Machine Learning with R, The Tidyverse, and mlr With Hefin Rhys, Researcher and Scientist

You are listening to the HumAIn podcast. HumAIn is your first look at the startups and industry titans that are leading and disrupting artificial intelligence, data science, future of work and developer education. I am your host, David Yakobovitch, and you’re listening to HumAIn. If you like this episode, remember to subscribe and leave a review now on to the show.

David Yakobovitch

Listeners, Welcome back to the HumAIn podcast. Today, I have a special guest on our show, Hefin Rhys¹, who works at the Francis Crick Institute² in the United Kingdom. He’s a researcher and scientist. And also a new author of a machine learning book with Manning Publications called Machine Learning with R, tidyverse and MLR, which goes live in just a couple weeks. Hefin, thank you so much for being with us on HumAIn.

Hefin Rhys

Thank you very much for having me. It’s nice to be here. Thank you. Hello, everyone.

David Yakobovitch

This is so fun because there’s so much going on in the world and, offline, we were just speaking before the episode about coronavirus, and that is not a topic that’s fun. It’s a topic that is very serious. And as if it’s the only topic the world has been talking about for the past two months now.

You’re a researcher traditionally, and you actually worked a lot in biology and medical systems. I wanted to hear your take on what’s going on with the Coronavirus around research.

Hefin Rhys

Sure. I want to caveat this by saying that I’m not a virologist or an epidemiologist. So my view is not that of someone who is an expert on this virus, but it’s clearly something that’s very serious and that we need to take seriously and treat with respect.

But having said that, because it’s new, it’s likely transmitted from a bat, it makes for a very cool, and sort of sexy and mysterious and worrying news story. So as much as the virulence of the virus itself is concerning, I particularly consider how viral misinformation and misinformed practices have gone along with it.

So for example, health bodies around the world who basically say that wearing face masks is very unlikely to prevent you from contracting #COVID19. And that actually, I also heard that somebody suggested or a study that suggested that wearing face masks could even increase your likelihood of contracting it.

And the panic measures that people go through buying face masks, or panic stocking up on foods and supplies because they think this is going to be sort of a T virus, or result in such an extreme impact on the population.

It does obviously kill people in it. And people have sadly died from it. The people that have died have largely been older people, people with underlying health conditions already, or people that have been immunocompromised for the vast majority of people that are healthy and have strong functioning immune systems.

If you contract COVID-19, you’re going to feel pretty rubbish. You get symptoms very similar to the flu, but it’s very unlikely to kill you.

And this is the other thing, that really, we need to ground ourselves and remember the strains of influenza that we already know about and live within our communities.

And you have colleagues, friends, family who get the flu. Sometimes those strains are much more prevalent. In fact, so many more people have higher mortality rates. And one of the reasons why we’re talking about it so much and why it is such a big deal is that COVID-19 is particularly virulent.

It is easily transmittable, more than influenza, but it’s important, for the sake of people’s health, as much as anything, to just remain grounded and not panic and just stay calm and monitor the disease. And to implement practices that actually stop the spread. So, rather than issuing face masks, which take away face masks from medical professionals, as we’ve had in the UK, we’ve had shortages for people that actually need them, just implement hand washing techniques.

Hand washing is the most effective way to prevent the transmission of COVID-19. So, as far as research goes, I can’t comment an awful lot because I’m here at the Francis Crick. It’s not actually a project that I’m aware of, that we are researching, but I know a number of institutes and hospitals around the world are looking for a vaccine.

But it’s an influenza virus, so it’s not going to be something that’s going to be cured, it may just be a case of letting it run its course until you feel better and self isolating, of course. If you feel like you have flu-like symptoms, whether it is just a normal flu, just don’t go to work. Don’t go to school, self isolate, prevent others from getting ill as well, speak to your local healthcare providers and seek advice.

That’s my non-virologist take on the current situation, and actually I’m going to complain again. I don’t know if people were here really. I’m going to complain about what is otherwise an absolutely beautiful dashboard. And I do love a lovely dashboard with the black and reds. It evokes a sort of Walking Dead feeling of pandemics and end of the world vibes.

David Yakobovitch

That’s right. I’ve been working with students lately, on some capstone projects. We’ve been using both Python and our programming languages. And one of the group said they want to work on the Coronavirus, COVID-19 project. And I said to them, all right.

I actually had an article that came out a few weeks ago on Medium, where I was talking about how to fight the coronavirus with AI and data science. And by this point, many people might know that BlueDot Global is the big company out in Toronto. That has been the same company, actually, that predicted a lot for both SARS and Ebola and other conditions.

But, I said to this team that wants to work at Python R said, what data is available? Where we are in 2020 everyone gets so excited when you go through and you see, wow, Harvard’s working on this. Yes, people are wearing face masks. There’s so much you can do. Just like you mentioned, quarantine yourself to stave off a global pandemic. But it’s really not just that, it’s all about the data.

So you need to be able to have data to extract insights. And when we look at dashboards, whether they’re from Johns Hopkins University, or the Ministry of Health in Singapore, which has a similar one as well. And even the Korean government, their ministry of health recently made public their data. It’s so sparse, there is not that much information available.

And so the students that I work with say, I want to do all this great feature analysis. I want to do all these great visualizations. If Johns Hopkins can do it, I can do it too. And I say, yes, you can. And you need data. And that’s one of the big challenges we’re experiencing in 2020, which I’d love to dive deeper with you during our show.

As a new author, the work that you’re focusing on is in the R programming language, and whether it’s for healthcare or social scientists or any type of researchers, I know that you focus generally on three major areas of R in your book, the tidyverse, Tibble and MLR.

When we look at these packages, they all do something so unique and very different in their own respect than Python, but they’re not all machine learning. There’s a lot more to the packages than that. Can you share with our listeners a little bit about your inspiration of why you’ve come out with this book? to start with, and then we’ll dive deeper into some of those topics.

Hefin Rhys

Absolutely. So I feel like I should start by caveating or apologizing almost, because I am an outsider. I should really put that straight out there. I am not a computer scientist and I am not a machine learning researcher. So why have I got any business writing a book like this?

I work in life sciences. I studied as a pharmacologist trying to understand how drugs work, and then my PhD was in immunology. And basically, throughout my PhD, the questions that we were asking and the data and volume of data that we were generating started to mean that the traditional analysis methods that we had been using and that other people in biological fields were using started to not quite suit our needs, not quite answer our questions.

So, I started working with R, one of the reasons being that it was free and that I was a poor student learning it from the ground up. And of course I’m still learning, which is true of anyone that uses a programming language. And then started to learn more about and understand and apply, and then get valid answers from machine learning and techniques applied to the data.

And then the thing in biological life sciences is that the level of maths left them. Math literacy tends not to be always that great, at least in academia. And if you are someone who knows how to code, who knows, understands how to model things for statistics or apply machine learning, you become gold dust.

So I started to teach statistics, R and #machinelearning during my PhD. And just for accessibility, so that people could either watch again, or if they missed my lectures, I’d record them and stick them on YouTube. So I just had a YouTube channel with a few videos on a few different topics to do with data science and R statistics.

And then I got contacted by Manning, who apparently publishers have these people that go around looking for candidates to write books for them, who stumbled on my YouTube Channel and asked me if I’d submit a book proposal. And I said, thank you. But there are people that are definitely more qualified than I am to write this book.

You probably want to contact the Hadley Wickham’s and the data scientists and machine learning engineers. And they said, well, there are some phenomenal books by those kinds of people, but what we really want is a book from the perspective of someone that has come to machine learning and AI and has learned it and applied it to their daily lives. Because they really wanted a book that was not for computer scientists necessarily, but more for people who were an expert in their own area but who could use and benefit from machine learning.

So, as a slight embarrassment on my part, the book changed names at some point. You’ll notice that the name that you can see on your screen is no longer the final running title. So I started writing the book with the view that it would be a book for people who are experts in their own field. Maybe academics, researchers, journalists, economists. Maybe people who are experts in their own fields and who don’t necessarily want to, or can’t, or have the time to become experts in something else.

Because becoming an expert in something takes years. But who could benefit from understanding and learning machine learning to basically make predictions and extract meaningful insights from the data that they have. And I kind of wrote it for myself 10 years ago, thinking about how stupid I was. I don’t assume the read is stupid, I was stupid for my level of math understanding and for someone who at that time had not much knowledge or experience with machine learning. So it’s a fun, sometimes tongue in cheek approach, to machine learning. It seems you have some basic knowledge of R, or quite frankly, if you have basic Python skills, R and #Python, you’ll be able to pick up on very quickly.

So I have some basic R or Python skills, but who are new to machine learning and to make things simple and fun, and also to teach them a modern approach to machine learning. I teach the Tidyverse set of packages that basically help make your data science skills nice and streamlined, and allow you to do extremely complex data manipulation and transformation very easily. And also to create beautiful graphics with the ggplot2 package. And then that makes machine learning much easier.

In machine learning, I use the MLR package, which gives you a really nice uniform interface to a huge variety of machine learning techniques and approaches. I’m as graphical as I can be. And I present the math as much as possible as a nice to know, rather than a need to know. So if people’s math skills are a bit, because I’m not fantastic at maths at all, I’m a stereotypical biologist. So, if your math skills are not phenomenal, that does not mean that you won’t be able to dive into the book and start using these techniques.

David Yakobovitch

I love everything that you’ve been sharing, Hefin, because similar to yourself, I got started not as a traditional data scientist, but I got started with carpentry. And this is an organization that’s very much focused with researchers, genomics, ecology and scientists on empowering scientific learning with code. And, in fact, one of the first workshops that I delivered was all about R with visualization using ggplot2 and then analysis, and it was amazing.

I was working on this back in 2014 and I said, Oh, my goodness. What have I been missing out on using Microsoft Excel when there’s this R programming language? and it completely rocked my world. And since then I continue my learning journey, which has been back and forth with R and SQL and Python and other languages, but it’s moved very much towards Python, but this year in 2020 is very special. It is a comeback year for R, it’s a big comeback year, and I wanted to hear, why do you think that is?

Hefin Rhys

R and Python have this very strange relationship. I feel there is this weird rivalry between which is better and proponents of R say that R is better, proponents of #Python say that Python is better. And it can be fun, and tongue in cheek, sometimes. But I actually think it can also hurt people’s learning sometimes, because they’ll hear from someone “R is rubbish, only use Python” and then they’re missing out on some phenomenal data science tools.

The answer to the question of whether somebody should learn R or Python is yes, people should use either or both. Python is excellent. And has posted some things that if you were to look back at R a few years ago, Python would probably have been a more convenient choice for a lot of people for machine learning, for example. Or if people were going to end up deploying apps from the projects that they were working on. And one of the things that Python would have been considered winning over R for a long time is the phenomenal scikit-learn package, which is amazing and gives you this common interface to a huge number of machine learning tools and #algorithms.

And the reason that Python had won in that category was because R and having CRAN and the way that people contribute things to R is that basically every machine learning algorithm that was implemented by someone, was implemented by someone else in a different package. And these different packages and functions had different interfaces, different arguments, and every time you wanted to learn to apply a new technique, You’d need to read the documentation and learn a new package again. Whereas in scikit-learn you had this single interface.

But then come along with packages like Carat or MLR in R, which were kind of an answer to scikit-learn and create this common interface so that you learn how to use that package and then substituting in a variety of different machine learning techniques and algorithms is extremely simple and Caret is phenomenal. But my preference, and what I wrote about it in my book, is MLR. Maybe we can talk a little bit about why later, but one of the reasons why R is potentially packages like Carat and MLR are helping R make a comeback.

Probably really that you can put it down to Hadley fever, people like Hadley Wickham, and the guys from our studio and the other contributors to the tidyverse packages. If anyone is not aware, already what the Tidyverse is, it’s a collection of data science packages. No sooner have I said this on the screen, a set of packages that are designed to make common data science tasks, extremely easy, clean and reproducible. And of course there’s nothing that you could in a Tidyverse that you could not do using base R code.

But in using the Tidyverse, it makes your code much more readable, much easier, much faster in terms of typing anyway. And so, as you can see on your screen there, so the Tidyverse is a set of packages and the core packages are a deep layer, which is for manipulating and transforming your data. So, selecting columns, filtering rows, mutating, new columns, things like that, the absolutely famous ggplot2 putting library is a, and I’m a little bit biased, but is in my opinion, the best plotting library that there is. It’s a phenomenal reader for reading in beta, in a tiny format and a table. So if people are familiar with, and reading data into R you’ll be familiar with the data frame structure.

Tibble is a package that creates a new data structure called the tibble that just gets rid of a few of the features that most data scientists dislike about data frames. So for example, the whole string as a factor thing, when you create tables, you don’t have to worry about that. And printing of your data is much nicer. The tidy package for converting between long and wide format and the per package, which is phenomenal. It allows you to vectorize your functions.

So For loops can be a thing of the past basically, is extremely powerful. And so my book, I dedicate a whole chapter to these tools that hopefully will one make people better, faster, and enjoy their data science projects more.

And then, once people have got those under their belt, then when we move on to the meaty machine learning chapters. We use these tools over and over again, repeatedly, so that people get a real good feel for them. But it’s been a bit of a Renaissance in R, all these packages coming out. Part of the Tidyverse that R has become very sexy and fashionable again among data scientists, and those traditional differences as well in terms of fields, as to whether certain fields use R or Python has hung around. So for me anyway, academic biology, it’s mostly R at least here in the UK, because for a lot of people R has always had that mission of being a statistical language specifically, it’s very geared specifically towards that.

So people that have been doing, maybe, basic linear modeling, have fit that bill. So that’s what they’ve learned. And that’s what they’ve gone with. People looking to train neural nets and to do deep learning have traditionally linked towards Python. A lot of the deep learning libraries have been interfaced from Python first, but now, you can interface with Keras, #TensorFlow, H2O, within R, as well.

So there isn’t really anything that you can’t do in one or the other. Shiny now allows people to create web apps very easily using R, to create dashboards using R. So Tidyverse, Shiny, which I haven’t really mentioned, and these unified machine learning interfaces have played a big role in making R cool again, because some had predicted that it would fall by the waysides.

David Yakobovitch

It’s amazing to think that Python today has over 200,000 projects and it’s become so big with so many developers, but it’s become so scattered. And R today has less than 10% of the number of packages from Python. But there’s such a unified mission. That’s what I’m hearing from yourself too. They haven’t that it’s all these researchers and scientists collaborating together.

These packages that we’re looking at, that are provided in the tidyverse as someone like myself, who’s more focused in Python. To me, it sounds like this is the answer to Maplotlib and Pandas and NumPy, and a lot of these visualization and analysis packages that researchers in Python took for granted. But as you mentioned, Hefin, was not always available for those who were solving problems in R. There were great statistical packages, but now it’s a complete solution.

Hefin Rhys

Absolutely. There’s basically no reason for Python and R to compete, really, anymore. People should use whatever they feel more comfortable using, and depending on what your colleagues and what the people in your field use, and then whatever, when you start learning, when you get confident in that one, Lynn, the other one, like my Python skills lag far behind my R skills, but I’m in the process of trying to learn Python because there are useful things that they both can do.

And of course, we can interface with Python from within R and RStudio, for example. So we can incorporate code from both languages. So one of the things that R put me off using Python for some of my projects is that I wasn’t a big fan of #Maplotlib, but now actually you can even interface with ggplot2 from within Python. So there’s really no excuse.

David Yakobovitch

It’s so amazing how interoperable the languages are getting. All the latest visualization packages, machine learning packages are basically saying, if you want R, you want Python, we are here to help you to be successful with that. And two of the big ones that have been very state of the art recently, one is Plotly, that’s been pretty big in Python space, but has full interoperability with R as well.

And then even beyond that, as you mentioned, you could in notebooks or script files do code that is also interoperable between Python and R and there’s a few packages there, but one that’s really nice is Reticulate from RStudio. So it’s pretty cool to see that developers now have the options to code with either or both without, necessarily, the concern of being pigeonholed into just a package.

Hefin Rhys

Absolutely. One thing as well, that is very nice about the R, about using R, and I know less than for Python, so maybe I can get you to comment on it, is that R has a phenomenal community of people. You need only to tweet a question or ask for opinions, and hashtag our stats and you get a ton of really nice supportive answers back and a huge amount of support on github or stackoverflow.

It seems like the R community has one mindset of let’s create useful, helpful packages that are actually going to make a difference to people and help people contribute to those as well. I don’t know if Python has a similar feel. I’m not embedded in that community

David Yakobovitch

It’s so interesting because with Python, when you think of the language, although it is so popular, the challenge is, it’s such a distributed language. So you have people flying drones with Python, you have people who are building web apps with Python and those who are doing statistics and machine learning. So it’s so scattered. And that’s one of the challenges there, and by its nature, that could be one of the reasons that Python does have about more than 10 times as many packages, but that doesn’t mean they have as much quality. We need to have quality over quantity.

Hefin Rhys

Submitting a package to #CRAN, the Comprehensive R Archive Network, is not a difficult process at all, if you write your package well. But writing a package for it to be submitted on to CRAN has to meet certain criteria. The documentation has to be of a certain quality in data in a certain way. The script files have to be laid out and documented in a certain way. So the whole CRAN submission process selects for good quality packages.

And then, quite frankly, if there are poor packages, people won’t use them and people are quite happy to submit pull requests and help contribute to each other’s code, which is quite nice that there’s an additional repository of our packages called Bioconductor, which is held separately from CRAN, which is specific for, well, it tends to be bioinformatics, type packages, so things for genome sequencing data and that sort of thing, and that they have a similar process for your package. It has to meet a certain level of bioconductors.

David Yakobovitch

Looking at all this, both with R and Python, one of the common threads we’re seeing as developers is 2020 open source is making a huge comeback.

We’re seeing the democratization of machine learning and AI systems. One of the core packages that you mentioned, you’re featuring in your book is M L R and MLR is, you could say, the answer to scikit-learn. It’s R’s way to go all in and say we can do it just as good as you can. So can you tell our listeners a little bit about why you think R is democratizing data science and how MLR is part of that process?

Hefin Rhys

So people that are asking the really important questions, whether to do with business or science or health or whatever, the people that know how to ask and are asking those important questions are the ones that should be able to harness and implement statistics, data science, and machine learning to get those answers.

I don’t think that machine learning should be the purview only of mathematicians and computer scientists. I know that people get scared by this, and there is a gatekeeping argument to be made to say. You can do a lot of harm, actually, if you employ these techniques and you don’t do it in a way that is ethical or you don’t do it in a way where you validate your results properly.

But I really strongly think that as long as you teach people how to do things properly, that they have enough of an understanding of how the techniques work and what they do and what they don’t do, then, absolutely, we can democratize machine learning and make it so that researchers can identify the patients that are at higher risk of contracting COVID-19.

So for example, my book is not going to train the next generation of machine learning researchers, from reading my book. People are not going to be designing self-driving cars or swapping Obama’s face onto the body of another actor or that sort of thing.

But we can absolutely teach people to be able to use these techniques, to extract the answers or make the predictions that they’re looking for in their field of expertise. And I spent a lot of time in the book showing people how to properly validate their models. And we do it again and again, and that is really important.

The MLR package, which stands for machine learning in R funnily enough, when I discovered it, it kind of changed my world. It’s phenomenal. As you said, it is kind of an R answer to scikit-learn along with the Caret package. And it provides a unified interface to a huge number of, not only actual machine learning algorithms, but also processes and functions like missing value, imputation, hyperparameter tuning, #validationtechniques.

And when you’re setting up a project or to run an analysis with MLR, the basic run-through is that you create a task. And the task is simply a definition of the datasets that you’re working with. And if you are performing a supervised machine learning task, you also define the target. So you tell it which variable or variables you’re hoping to predict from the data. So you define your task.

The second step is that you define your learner, and the learner is simply a definition of what algorithm you’re going to learn to try to learn the patterns in your data, along with any options or hyper parameters that you want to supply to that. And the way that MLR has been written is so that it already comes with a huge array of well-known machine learning algorithms, from things like k-Nearest Neighbors to XGBoost, you can interface deep learning models with it.

But the way it’s been written is that it is meant to be very easy for anyone. If you have a function that defines the way that a machine learning algorithm works, you can implement your own function. And actually you can also implement your own performance metrics and other things in it as well.

So it’s extremely extendable. So once you’ve defined your task, your learner, you then simply combine the two to train your model. So there’s three steps at present in any kind of workflow with MLR, and it sort of sounds a little bit cumbersome, but it’s really useful because if you define a single machine learning task, so I supply my data and I supply maybe my target variable as categorical variables stating whether somebody was infected by COVID-19, for example, you can then, with that one task definition, benchmark a huge number of different algorithms against that same task.

Or alternatively, if you define a task, you can then benchmark that one task across, sorry, benchmark that one learner against a large variety of different tasks. So it allows you to set up machine learning experiments to test how different algorithms perform in different settings. That’s fine. That’s easy. You can do that quite easily with other packages like Caret or you could do this manually.

Where MLR particularly shines is, It makes it extremely simple to validate your models, for example. So whether you’re performing, hold that validation, or you’re doing more complicated things like nesters, K fold, cross validation. It’s extremely simple to set that up. You could set up tenfold a nest of cross validation with 10 inner folds and three outer folds or whatever that you want to do, and then it allows you very easily to incorporate any data dependent pre-processing steps inside your validation.

So one of the mistakes that a lot of people make when they start training some complex machine learning models is that they do things like missing value, imputation, maybe hyper parameter tuning, maybe feature selection.

And they don’t include that inside the model validation. And so when they finally validate their model, they get an inflated sense of how good their model is going to perform. And MLR allows you to create these wrapper functions that allow you to include as many pre-processing steps as you want inside the validation.

So, in the simple case, you split your #data into a simple train test validation set, it will perform all of those processes for you during training and then test and validate your data. Then, if you’re doing something more complicated, like nested cross validation, it’ll ensure that for example, if you’re tuning a hyper parameter, all those different hyper parameter values that you test across will be given the same test sets.

So it makes these things very easy. You can also parallelize your codes with the help of the parallel package in R, so if you are training models that are quite intense, MLR works very nicely with parallelization. That’s a surprisingly hard word to say, and I’m a very lazy person. I like to achieve as much as I can in my data science projects with as little code as possible. And MLR helps me achieve that because you can do some extremely complicated validation pre-processing with very small amounts of code.

David Yakobovitch

I’m a lazy learner as well. I enjoy preparing and planning. And for those of us who are listeners, who are data scientists, when I say lazy learner, you might know what I’m talking about, but for the rest of us, it’s all about planning, preparation, getting those computational graphs set up.

It sounds like MLR definitely is the answer to scikit-learn, but I wanted to briefly dive into those other packages you mentioned, and that I’ve mentioned, that are still around. And if someone’s looking to learn R, why would they want to consider picking up Caret? So what’s your take on Caret?

Hefin Rhys

Caret is excellent. So I have not spent as much time working with #Carat as I have MLR. And the Caret is very nice. Caret has functions that you can use to split your data into train test validation sets. That’s quite nice. You can use it to interface with a huge number of machinery models and you can also extend it to include other packages as well.

And it has the ability for you to perform data pre-processing steps like missing value, imputation and things like that. And you can use it to also include those things inside your validation. But if you’re doing some very complicated validation, I find that a bit more cumbersome to use.

I also find setting up, ensembling, much easier in MLR. So it’s funny when you look at tutorials for either bagging or boosting people always talk about decision trees, because they’ve had a lot of success. But of course, we all know that you can use bagging and boosting for the algorithms other than decision trees. And with MLR, you can set up ensembles of whatever machine learning model that you want, and you can combine different algorithms within your same ensemble if you want to.

This is quite as easy inside Caret when I also have a nice system of plotting your tuning procedures, for example, if you’re performing hyper parameter tuning, it’s very simple. You simply pass the tuning results to its functions. It plots the tuning data for you, but both are useful.

MLR has become more popular recently. Caret has been the mainstay. It’s been there for a long time. It’s very good. There’s nothing that you can do in MLR that you can’t do using Caret. So it comes back again to the centerfold do I need to use one or the other?

If I was to compare the ease of use and readability, MLR would win for me anyway. If people are used to writing in Caret, give MLR a go and see whether you prefer its interface. I also find benchmarking in MLR a lot easier and nicer. You simply give it the task. You give it a list of learners and run the function and it benchmarks all of those against the same task, and importantly, it provides each with the same training sets and tests that petition as well. So you get a nice, accurate estimate of how each model is performed.

David Yakobovitch

So I’m going to make a bet here for the first time on HumAIn. So what I’m hearing is similar to in Python where scikit-learn and PyTorch have just taken over all machine learning from research to implementation. We’re seeing that in #MLR as a rise of an awesome package to basically be that one-stop shop. Caret will still be around, but also the final package, I wanted to hear your take on today as a newcomer to this space.

And I know it wasn’t featured much in your book, but a lot of people are talking about it in the R community. I want it here. What do you think about tidy models?

Hefin Rhys

This is very nice. It’s actually a set of packages that come from the Tidyverse. And in a similar way in which MLR is trying to create a uniform interface to machine learning. Tidy models are packages that are trying to create a unified approach to modeling in general. So that includes, and it’s probably more widely used, as linear modeling.

So to create a unified interface to know whether people are using the LM function for linear modeling in R, or whether people are using the BRMS package for Bayesian modeling.

I don’t really talk about it much in my book. The reason being is that the way that MLR learns models is a little bit different and that if you want to get the actual R model-objects out of your final MLR object, you need to extract that first. But if people are training models, the parsnip function, that is part of Tie them all models, gives you a unified interface to modeling.

So no matter what kind of, for example, linear model you’re modeling, you don’t have to memorize the different functions and arguments that you might need if you’re fitting a regularized linear model, then compare that to whatever package or function you might use to fit a non-regularized linear model.

So parsnip is a function, a package that allows you to keep the syntax the same. The dials package is again, part of tidy models that are designed for tuning parameters. And so with parsnip, you can use dials to tune hyper parameters, if you’re searching a parameter space and the way that the model is R, the output is in a tidy format.

So if people are used to using the #linearmodelfunction in R, you’ll know that if you call your object, you get a lot of information printed out, but it’s not very tidy. It’s in a strange layout that was deemed to be visually or aesthetically pleasing when the function was written, but now is not very convenient to extract particular things from.

So for example, if you want to extract particular coefficients or confidence intervals from your model output, it’s not difficult, but it’s not as nice. And if you’re looking at the output from models, from multiple different packages to extract the same information, you might need to use different syntax. And so when you train your models using tidy models, you get this tidy output where your model output is arranged into rows and columns, and it’s very easy to work with and extract information from.

David Yakobovitch

Excellent. We’ve talked a lot about different packages today in R. You are a new author, Hefin Reese. You’re going to be having your book coming out this month in March, on March 15th. It’s focused on machine learning with R, Tidyverse and MLR for you and your experience.

Well at least is not, what is it, R for mortals? But we’re all mortals here. Hopefully we will all do well in the fight against coronavirus, but gearing things back to your book. What was one of the most exciting things or enjoyable things about you publishing your book and becoming an author?

Hefin Rhys

So one of the things was, people are actually starting to ask me questions about content in the book, because then I finally realized that people were reading it.

So Manning does this thing called the Manning early access, so you can see meat there, where basically the book has been available to buy as an ebook for a few months already. And now the entire book is available. And as part of that, they have a book forum. And people can read the book there and can highlight mistakes that I’ve made, even before the book has been finally proof-read, which was a bit strange, and can ask questions on there.

So that was particularly exciting because people were asking questions, which was nice because they wanted my opinion and understanding of things. And it also showed me that people had actually bought this book and it’s actually a real thing, which is phenomenal.

The other thing was when the process that Manning goes through for a precinct book is extraordinary. It’s extremely robust. I wrote the book. It was read by my editor, a technical editor, several people that were experts in particular areas. And then once they deemed that the book, or certain chapters were in good enough condition, it got sent off to reviewers. And so it got sent off to around 18 people to read and then give me their feedback. And that was absolutely terrifying because you put this content out there for people to read, and then suddenly of course you’re open to scrutiny and criticism.

But it was very nice because when the reviewers came back, they came up with some very nice, helpful comments, but the feedback was overwhelmingly positive, which was very nice. Also, I can’t wait to have an actual physical paper book in my hand, because of course, at present, it’s just an ebook. So when I can finally hold the book in my hand, that will make it finally real. And I can’t promise that I won’t cry. It’s been a bit of a roller coaster writing.

It’s taken me quite a long time, but I really do think that it’s an excellent book. And it sounds very braggy of me and I don’t mean to be, because although I wrote the content, a huge number of people other than me have made the book very good. So I do think that people will learn a lot and get a lot from it. And if you don’t, then contact me and I’ll apologize. But even if people are not new to, and you’re not new to machine learning, if you’re interested in the MLR package, you would still get some benefit from it.

David Yakobovitch

That’s right. Even if people are not new to R, they’ll learn a lot about R, see what I did there? I love the R programming language. It’s making a huge comeback this year in 2020. I’m in Hefin Reese. Thank you so much for sharing everything that you’re doing with tidyverse with MLR and the whole tidy ecosystem in your book coming out March 15th, paper, physical copies. Thank you for being with us on the podcast.

Hefin Rhys

Thank you very much for having me. Thank you.

David Yakobovitch

Thank you for listening to this episode of the HumAIn podcast. What do you think? Did this show measure up to your thoughts on artificial intelligence, data science, future of work and developer education?

Listeners. I want to hear from you so that I can offer you the most relevant transcending and educational content on the market. You can reach me directly at humainpodcast.com/contact. Remember to share this episode with a friend, subscribe and leave a review on your preferred podcasting app and tune into more episodes of HumAIn.

Works Cited

¹Hefin Rhys

Companies Cited

²The Francis Crick Institute

Solid Data AI Thought Leadership

Actually being done in AI

Thought-provoking

Putting things into perspective

Digging into AI