**Topic:** With all the hype about deep learning and “AI”, it is not well
(enough) publicized that for structured/tabular data widely encountered in
business applications it is actually another machine learning algorithm, the
gradient boosting machine/gradient boosted decision trees (GBM/GBDT) that
most often achieves the highest accuracy in supervised learning/prediction
tasks. In this talk we’ll provide plenty of evidence about the vast
superiority of GBMs over deep learning on tabular/business data. Then, we
will present some of the major open source GBM implementations such as
xgboost, h2o, lightgbm and catboost (all of them available from R and
Python), and finally, we will compare their main performance characteristics:
training speed, memory footprint, scaling to multiple CPU cores, GPU
implementations etc. While deep learning is certainly the best algorithm
available for computer vision (and it has also shown some success in a few
other rather specialized domains), in most business applications, where the
data is most often of a tabular structure, gradient boosted decision trees
are vastly superior to deep learning neural networks and should definitely be
the algorithm of choice.

**Bio:** Dr Szilard Pafka is Chief Scientist at
Epoch. Szilard studied Physics in the 90s and obtained a PhD by using statistical
methods to analyze the risk of financial portfolios. He worked in finance,
then in 2006 he moved to become the Chief Scientist of a tech company in
Santa Monica, California doing everything data (analysis, modeling, data
visualization, machine learning, data infrastructure etc). He was the
founder/organizer of several meetups in the Los Angeles area (R, data science
etc) and the data science community website datascience.la for more than a
decade until he relocated to Texas in 2021. He is the author of a well-known
machine learning benchmark on github (1000+ stars), a frequent speaker at
conferences (keynote/invited at KDD, R-finance, Crunch, eRum and contributed
at useR!, PAW, EARL, H2O World, Data Science Pop-up, Dataworks Summit etc.),
and he has developed and taught graduate data science and machine learning
courses as a visiting professor at two universities (UCLA in California and
CEU in Europe). You can follow him on LinkedIn,
Twitter or Github.

The lecture took place Friday, October 28, at 1pm Central. The presentation slides are available, a video recording is available (to anybody with a valid University of Illinois ‘Netid’). An unrestricted YouTube! link is also available.

**Topic:** Machine learning practice, often called data science, emphasizes empirical tuning of
predictive models. When these practitioners run into common problems they propose and promote fixes
somewhat different than the statistical canon. I’ll discuss two issues where data science practice
differs from statistical inference: co-linear variables and building classifiers for un-balanced
models. For co-linear variables the data science practice is often “regularize and ignore”, which I
will define and explain why this fire and forget procedure seems to work. This lets us start to
explore the consequences of using prediction quality as an exclusive model quality metric. For
un-balanced models I argue that the result is the opposite: ignoring the internal probabilistic
structure of the problem leads to unnecessarily clumsy work arounds. The goal is to show how to
appreciate data science as street fighting statistics.

**Bio:** Dr. John Mount is a Principal Consultant at Win Vector LLC. John has a Ph.D in computer science
from Carnegie Mellon University, using probabilistic methods to prove convergence rates of Markov
chains in optimization and sampling applications. He did work on structural diversity of molecules
for biotech applications, wrote and executed algorithmic trading strategies for Banc of America
securities (a division of Bank of America). He is now concentrating on data science, machine
learning, AI and analytics consulting and teaching. His most recent teaching product is a two week
private immersion course on data science for engineers. He is the co-author of the book Practical
Data Science with R, from Manning and now it its 2nd edition. He is the author of a number of
packages for data science in both R and Python. You can follow him on LinkedIn
https://www.linkedin.com/in/johnamount/ , Twitter https://twitter.com/winvectorllc , or Github
https://github.com/WinVector .

The lecture took place Wednesday, November 9, at 5pm Central. The presentation slides are available, as is the supporting GitHub repo A video recording is available (to anybody with a valid University of Illinois ‘Netid’). An unrestricted YouTube! link is also available.