useR! 2019

Toulouse - France

Talk schedule

An application to browse the program is also available here.

Click on a talk for more information.

Time	Session	Speaker	Title	Chair	Room
10:00	Keynote	Julia Stewart Lowndes	R for better science in less time	Julia Silge	Concorde 1+2
There is huge potential for R to accelerate scientific research, since it not only provides powerful analytical tools that increase reproducibility but also creates a new frontier for communication and publishing when combined with the open web. However, a fundamental shift is needed in scientific culture so that we value and prioritize data science, collaboration, and open practices, and provide training and support for our emerging scientific leaders. I will discuss my work to help catalyze this shift in environmental science through the Ocean Health Index project. Over the past six years, our team has dramatically improved how we work by building an analytical workflow with R that emphasizes communication and training, which has enabled over 20 groups around the world to build from our science and code for ocean management. R has been so transformative for our science, and we shared our path to better science in less time (Lowndes et al. 2017) to encourage others in the scientific community to embrace data science and do the same. Building from this, as a Mozilla Fellow I recently launched Openscapes to engage and empower scientists and help ignite this change more broadly.
11:30	Shiny 1	Kamil Wais	Logging and Analyzing Events in Complex Shiny Apps	Aline Deschamps	Saint-Exupéry
The "shinyEventLogger" package is a logging framework dedicated to complex shiny apps. Its main goal is to help to develop, debug, and analyze usage of our apps. Multiple-line events logging can be done simultaneously, not only to R console, but also to your browser JavaScript console, so you can see logged events in the real-time, using your app already deployed to a server (shinyapps.io, rsconnect). Moreover, your events (together with unique sesssionID and timestamp) can be saved as a text file or into a remote database (currently MongoDB), and be ready for further analysis with the help of process-mining techniques from "bupaR" package. You can log different types of events, for example: * `log_event("Hello World!")` * `log_value(input$variable)` * `log_output(str(mtcars))` * `log_test(testthat::expect_true(TRUE))` * and others (errors, warnings, messages). You can time nested events for performance analysis. If you are logging a value of an evaluated expression, not only the value will be logged but also the expression itself. Each event can be logged with a list of parameters that are event-specific or common for a group of events. CRAN: https://cran.r-project.org/package=shinyEventLogger Vignette: https://kalimu.github.io/shinyEventLogger/articles/shinyEventLogger.html
11:48	Shiny 1	Hannah De Los Santos	mwshiny: Connecting Shiny Across Multiple Windows	Aline Deschamps	Saint-Exupéry
We present mwshiny, a package that extends Shiny applications across multiple windows. Shiny lets R users develop interactive applications, alleviating the need for web development languages. Increasingly, users have access to multi-monitor system configurations; further, with Shiny apps hosted online, the possibility of a remote controller driving a visualization-focused monitor provides another necessity for a multi-window environment. Using mwshiny, users set up the interface and functionality of these multi-window applications by using Shiny's familiar syntax and conventions. By elegantly utilizing Shiny's reactive structure, we break down app development into a simple workflow with three parts: user interface development, server computation, and server output. We demonstrate this workflow through three case studies, including the aforementioned multi-monitor and controller-driving situations, in which we focus on population dynamics and cultural awareness, respectively. We also show mwshiny as applied to an immersive healthcare visualization; the Rensselaer Campfire, a new interface designed for group interaction, consists of two connected monitors in the form of a cylindrical fire pit, as well as an outside controller. These case studies show the impact of mwshiny as we move into a future with ever more immersive applications and structures.
12:06	Shiny 1	Riccardo Porreca Roland Schmid	Shiny app deployment and integration into a custom website gallery	Aline Deschamps	Saint-Exupéry
R Shiny has become overwhelmingly popular and widespread in the R community. Shiny web applications provide companies and individuals with an excellent way of demoing and showcasing data analytics and visualization, and interactive R-based projects in general. Although products and services exist for exposing Shiny apps on the web (Shiny Server, RStudio Connect, Shinyapps.io, ShinyProxy), there is a desire for and a clear benefit of full integration with an existing website. This allows providers to retain full flexibility in terms of customization and the experience provided to users exploring their apps. In this talk, we cover the challenges and requirements for the seamless embedding and integration of Shiny apps into a custom gallery part of a larger website. We will show how we used Docker and Kubernetes as natural choices for deploying and running Shiny apps, and for customizing the way they are served and made accessible through the web. We explain how to set up an embeddable Shiny app, how to build, maintain and deploy the corresponding Docker image, and how to configure the relevant Kubernetes resources. Finally, we will demonstrate how all this can easily be combined into a GitHub Pages website based on Jekyll.
12:24	Shiny 1	Machteld Varewyck	Automated Surveys and Reports for Expert Elicitation with Shiny	Aline Deschamps	Saint-Exupéry
Experts are often consulted to quantify uncertainty on subjects when insufficient data is available, e.g. the temperature in Toulouse next week. The goal is to find a consensus by combining all experts' knowledge. We have built an R Shiny application to automate this procedure for an international governmental organization. The administration module allows to create a survey, invite experts and control the elicitation phase. When elicitation is completed, a report can be built automatically. This helps to find a consensus by plotting e.g. the combined distribution of all experts' elicitations. The administrator can, at any time, download the available data for use outside the Shiny application. In the elicitation module, the assigned experts can define values for the requested quantiles in a table. The corresponding distribution is shown in an interactive barplot. By dragging the breakpoints in the barplot, the provided elicitation values in the table are automatically updated and vice-versa. During this talk, we will share our experiences on deployment with ShinyProxy. This allowed us to easily restrict the administration module to authorized users only and to guarantee persistent data storage using a docker volume.
11:30	Data handling	Hadley Wickham	Enhancements to data tidying	Kirill Müller	Concorde 1+2
The goal of the tidyr package is to help get your data into a "tidy" form, where variables are found in columns and observations are found in rows. tidyr has two main drawbacks: * Many people (including me!) find it hard to remember exactly how spread() and gather() work, and most usage require re-reading the documentation. This suggests there are fundamental problems with their design. * Many web APIs provide data in JSON, which turns into deeply nested lists when loaded into R. tidyr provides few tools for tidying or rectangling this sort of data. In this talk, I will introduce new tools in tidyr 1.0.0 that aim to solve both problems, making it easier to tackle traditional rectangular reshaping, as well as making it easier to rectangle deeply nested lists into a convenient form.
11:48	Data handling	Romain Francois	n() cool #dplyr things	Kirill Müller	Concorde 1+2
dplyr, which provides tools for data summary and transformation, is one of the key user-facing packages of the tidyverse. dplyr is so powerful because each function is small and does one thing well. But this can also make it hard to learn, because it's not obvious what you can do, and there are often non-obvious tricks that can make your life much easier. In this talk, I'll summarise() a set of n() cool tricks, arrange(desc(NEWS)) to highlight some of the recent changes, and give you glimpse() of some of our thinking about the future.
12:06	Data handling	Zhuo Jia Dai	You don't need Spark for this - larger-than-RAM data manipulation with disk.frame	Kirill Müller	Concorde 1+2
R is blessed with the very best data munging tools such as dplyr and data.table. However, R requires data to be loaded to RAM in the form of a data.frame; and a corollary of this is that R cannot deal with larger-than-RAM data easily. This talk introduces the disk.frame package which is designed to manipulate "medium" data - those datasets that are too large to fit into RAM but can be manipulated on a single machine with the right tools. In a nutshell, disk.frame makes use of two simple ideas to make larger-than-RAM data manipulation feasible * split up a larger-than-RAM dataset into chunks and store each chunk in a separate file inside a folder and * provide a convenient API to manipulate these chunks Furthermore, disk.frame optimises on-disk data manipulation * by using state-of-the-art data storage format provided by the fst package to efficiently read and write data to disk * by parallelizing operations with the excellent future package * by using the data.table package's fast algorithms for grouping and merging Finally, disk.frame supports many dplyr-verbs to make it accessible to useRs who are already familiar with dplyr.
12:24	Data handling	Yves Croissant	Data frames for grouped data: the gdata.frame package	Kirill Müller	Concorde 1+2
The formula-data interface is a critical advantage of R in order to describe models to be estimated. However, more enhanced formula and data are often usefull. The new gdata.frame package tackle the case when the data set is characterized by two indexes. Two examples are panel data, for which observations are defined by an individual and a time period, and random utility models, for which observations are defined by a choice situation and an alternative. Moreover, indexes may have a nesting structure: for example, for a panel data of countries, the individual (country) index can be nested in a continent index. With gdata.frame, the indexes are stored as an attribute of the data.frame and extracted series inherit from it, thanks to specific methods for generic extractor functions. Usual operations, like for example group means or deviations from group means can then be computed in a very natural way, ie without having to define each time the series which defines the structure of the data set. gdata.frame therefore provides a generic solution to deal with grouped data. More specific data structure can then be defined that inherit from it. This feature is illustrated using the plm (for panel data) and the mlogit (for discrete choice models) packages.
12:42	Data handling	Thierry Onkelinx	git2rdata: storing dataframes in a plain text format suitable for version control	Kirill Müller	Concorde 1+2
Base R has write.table() and read.table() to work with data in plain text files. Factors are stored as their labels, making them indistinguishable from characters and losing information on the levels. The git2rdata packages provides write_vc() which stores a dataframe as two plain text files: a tab separated file with the data and a metadata file in YAML format. The metadata stores the class of each variable and all other relevant metadata, e.g. the factor levels and their order. read_vc() reads the raw data using read.table() and restores the variables based on their metadata. Storing the metadata also allows to optimize the file storage, e.g. by storing factor indices rather than factor labels in the data file. This optimization can be turned of in case human readable data is preferred over smaller files. Git stores changes as row wise diffs. Swapping the order of two variables results in changing every single row of the plain text file. write_vc() avoids this be reordering the variables based on the existing metadata. Sorting the observations along user defined variables result in a stable row order. Metadata can be overridden if needed to accommodate a change in variables. git2rdata is useful to store and retrieve data in a reproducible workflow. Installation instructions, documentation and vignettes are available at https://inbo.github.io/git2rdata/
11:30	Models 1	Georges Monette John Fox	A Generalized Framework for Parametric Regression Splines	Torsten Hothorn	Caravelle 2
Regression splines are piecewise polynomials constrained to join smoothly at boundaries called knots. They are traditionally viewed as an alternative to other methods for modeling nonlinear relationships, such as transformations, polynomial regression, and nonparametric regression. Regression splines are parametric and are implemented by introducing a regression-spline basis into the model matrix for a linear or similar regression model. It is usual not to focus on the estimated parameters for a regression spline but instead to represent the model graphically, and traditional regression-spline bases, such as B-splines and natural splines, respectively implemented in the bs() and ns() functions in the R splines package, are selected for numerical stability rather than interpretability. The emphasis on graphical interpretation makes sense but also represents a missed opportunity. We introduce generalized regression splines, implemented in the gspline() in the carEx package, which support the specification of a much wider variety of regression-spline and piecewise-polynomial models using bases that are associated with interpretable parameters.
11:48	Models 1	Michela Battauz	Regularized estimation of the nominal response model	Torsten Hothorn	Caravelle 2
The nominal response model is an Item Response Theory (IRT) for polytomous items model that does not require a predetermined order of the response categories. While providing a very flexible modeling approach, it involves the estimation of many parameters at the risk of numerical instability and overfitting. The lasso is a technique widely used to achieve model selection and regularization. In this talk, we propose the use of a fused lasso penalty to group response categories and perform regularization. An adaptive version of the penalty is also considered. Simulation studies show that the proposal is quite effective in grouping the response categories, thus leading to a more parsimonious model. A remarkable advantage of the procedure is the reduction of both bias and root mean square error in small samples, while no difference is observed in large samples. An application to TIMSS data will illustrate the method. The R package regIRT (available at https://github.com/micbtz/regIRT) implements the methods.
12:06	Models 1	Alessandro Gasparini	merlin - mixed effects regression for linear and nonlinear models	Torsten Hothorn	Caravelle 2
The rise in availability of electronic health record data raises both challenges and opportunities for new and complex analyses. The merlin package in R provides an extended framework for the analysis of a range of data types, encompassing any number of outcomes of any type, each of which could be repeatedly measured (longitudinal), with any number of levels and with any number of random effects at each level. Many standard distributions are described, as well as non-standard user-defined non-linear models. The extension focuses on a complex linear predictor for each outcome model, allowing sharing and linking between outcome models in a flexible way, either by linking random effects directly, or the expected value of one outcome (or function of it) within the linear predictor of another. Non-linear and time-dependent effects are also seamlessly incorporated to the linear predictor through the use of splines or fractional polynomials. merlin allows level-specific random effect distributions and numerical integration techniques to improve usability, relaxing the normally distributed random effects assumption to allow multivariate t-distributed random effects. We will take a single dataset of patients with primary biliary cirrhosis and attempt to show the full range of capabilities of merlin.
12:24	Models 1	Ruggero Bellio	Modern likelihood-frequentist inference with the likelihoodAsy package	Torsten Hothorn	Caravelle 2
The talk illustrates the R package likelihoodAsy, available on CRAN and implementing some tools for higher-order likelihood inference. The basic functionality of the package is the implementation of the modified directed deviance for inference on a scalar parameter of interest, that could be seen as a fast method to approximate the most accurate parametric bootstrap inferences for the same task. The usage of the package requires to provide code for the likelihood function, for generating a sample from the model, and for the formulation of the interest parameter. The latter is allowed to be a rather general function of the model parameters. The code includes some functions for computing the modified profile likelihood for a multidimensional parameter of interest, and it could also be used to approximate median-unbiased estimation in a parametric statistical model. The features of the package will be illustrated by means of some examples on survival data models, IRT models and mixed models.
12:42	Models 1	Genaro Sucarrat	General-to-Specific (GETS) Modelling with User-Specified Estimators and Models	Torsten Hothorn	Caravelle 2
General-to-Specific (GETS) modelling provides a comprehensive, systematic and cumulative approach to statistical modelling ideally suited for hypothesis testing, forecasting and counterfactual analysis. The implementation of GETS-modelling, however, puts a large programming-burden on the user, and may require substantial computing power. We develop a framework for GETS-modelling with user-specified estimators and models, and provide flexible and computationally efficient functions via the R package gets (Pretis, Reade and Sucarrat (2018), J. of Statistical Software). In addition, the framework permits the user-specified estimation to be implemented in an external language (e.g.\ C++, Python, STATA, EViews, MATLAB, etc.).
11:30	Applications 1	Rezgar Arabzadeh	Application of WRSS in Water and Energy Analysis; An object oriented R package for large-scale water resources operation	Wayne Jones	Ariane 1+2
Water Resources Simulator, an object-oriented open-source software package based on R language, is developed for simulation of water resources systems based on Standard Operation Policy. In spite of numerous commercially available software packages for modeling and simulation of water resources systems, only a limited number of free and open source tools are available in this regard. This acted as the initiative for the development of WRSS which enables water resources engineers to study and assess water resources projects by modeling and simulation. This package provides users a number of functions and methods to build a model, manipulate its components, simulate the scenarios and publish and visualize the results in a water resource system study. WRSS is capable to incorporate various components of a large and complex supply-demand system as well as hydropower analysis for reservoirs since they have not been available in any other R packages. In addition, a particular coding system, devised for WRSS, allows water resources component to interact together by transferring mass in terms of seepage, leakage, spillage, and return flow. In addition to its capabilities, the package was successfully applied to a case study of a water resources system composed of 5 reservoirs and 11 demand sites and the results proved the role of WRSS in the enhancement of the dam operation under the large-scale model.
11:48	Applications 1	Emanuele Cordano	An R Package for the Distributed Hydrological Model GEOtop	Wayne Jones	Ariane 1+2
Eco-hydrological models are increasingly used in the contexts of hydrology, ecology, precision agriculture for better management of water resources and climate change impact studies at various scales: local, hillslope or watershed scale. However, with increasing computing power and observations available, bigger and bigger amounts of raw data are produced. Therefore, the need to develop flexible and user-oriented interfaces to visualize and analyze multiple outputs, e.g. performing sensitivity analyses, comparing and optimizing against observations (for specific research) or extraction of information (for data science), emerges. We present here the R open-source package geotopbricks (https://CRAN.R-project.org/package=geotopbricks), which offers an I/0 interface and R visualization and optimization tools for the GEOtop hydrological distributed model (https://www.geotop.org - GNU General Public License v3.0). This package aims to be a link between the work of environmental engineers, who develop hydrological models, and the ones of data and applied scientists, who can extract information from the model results. Applications related to the simulation of water cycle dynamics (model calibration, mapping, data visualization) in some alpine basins are shown.
12:06	Applications 1	Friedrich-Claus Grueber	Big data analysis for power plant operational data for optimizing engineering design in R / Shiny	Wayne Jones	Ariane 1+2
A typical gas fired power plant is equipped with sensors and instrumentation devices to capture key operating parameters of power plants. With a large fleet of power plants operating around the globe over >= 10 years , the volume of sensor data being captured enters the realm of big data and provides an opportunity to apply big data analytic techniques to mine valuable information for optimizing engineering design of power plants. The often used "worst case approach" usually leads to large design margins and increased cost. With the advent of big data technology, low cost computation and memory, it is now possible to mine the sensor measurements to either validate or dispute worst case assumptions made during design. Therefore a core processing R kernel was developed performing statistical and data analysis on time series data like: KDEs, pdfs, correlations,Time series probability estimations in regions conditional probability analysis violin/box plots Event Detection An Shiny application has been developed integrating the core processing kernel and providing data analytic functionality available for design engineers like automatic : - shortlisting/specification of senor names - generatiion of fitting list of plants & units of our fleet - download and post processing of time series data - GUI for data analytic techniques/visualations
12:24	Applications 1	Priyanga Dilini Talagala	Anomaly Detection in R	Wayne Jones	Ariane 1+2
Anomaly detection problems have many different facets that lead to wide variations in problem formulations. At present, there is a fairly rich variety of R software packages supporting anomaly detection tasks using different analytical techniques. Some of these use an approach to anomaly detection based on a forecast distribution. We locate over 75 R packages with anomaly detection capabilities via a comprehensive online search. We first present a structured and comprehensive discussion on the functionality and capability of these publicly available R packages for anomaly detection. Despite the large number of packages available, there are some anomaly detection challenges that are not supported with existing packages. We reduce this gap by introducing three new R packages for anomaly detection, `oddstream`, `oddwater` and `stray`, with special reference to their capabilities, competitive features and target applications. Package `oddstream` introduces a framework that provides early detection of anomalous behaviours within a large collection of streaming time series. This includes a novel approach that adapts to non-stationarity in the time series. Package `oddwater` provides a framework for early detection of outliers in water-quality data from in situ sensors caused by technical issues. Package `stray` provides a framework to detect anomalies in high dimensional data.
12:42	Applications 1	Daniel Grose	Collective and Point Anomaly Detection in R	Wayne Jones	Ariane 1+2
Anomaly detection is a topic of considerable importance and has been subject to increasing attention in recent years. This is certainly true within the R community, as evidenced by the number of packages hosted by CRAN related to this area. The majority of packages contain approaches aimed at detecting point anomalies i.e. single observations, which are anomalous with regards to their data context. We introduce the anomaly package, which contains novel methodology aimed at detecting not only point anomalies but also collective anomalies, i.e., anomalous data segments. Importantly, the main algorithm of the package is capable of simultaneously detecting and distinguishing between both anomaly types. The utility of this approach is demonstrated on both simulated data, and real astrophysical data from NASA's Kepler telescope, which is included within the package.
11:30	Education	Kenia Wiedemann	Mathematical Modeling with R: Embedding Computational Thinking into High School Math Classes in the United States	Matthias Gehrke	Cassiopée
Only 35% of high schools in the US offer computer science (CS). However, by 2020 50% of all new STEM jobs will require some knowledge in CS and coding. Developing computational thinking (CT) across all disciplines and educational levels have become a priority for scholars and agencies, urging to integrate CT across K-12 curricula. CT can be better learned when blended to primary subject areas. CodeR4MATH is an NSF-funded project working to provide a collection of learning and tutoring resources for math educators to collaborate in an R environment. Linking math to real-life problem solving, math modeling is an iterative process involving situation representation, math operations, interpretation, and validation. The modules are being developed to be used in high school math classes in Stats, Algebra I and II. We present here the results from the first module (named Lifehacking) implementation, showing students greatly benefit from investigating meaningful problems that appeal to their personal interests. It guides students to model in R to answer practical questions such as the costs of eating in college or the costs of owning a car. Tasks are delivered via R tutorials using learnr tools, with students using real data to create their models. They are then guided to move to RStudio exploring its capabilities. The 2nd and 3rd modules are under development, focusing on Earth Sciences and Engineering.
11:48	Education	Irene Steves	Teaching data science with puzzles	Matthias Gehrke	Cassiopée
Of the many coding puzzles on the web, few focus on the programming skills needed for handling untidy data. During my summer internship at RStudio, I worked with Jenny Bryan to develop a series of data science puzzles known as the "Tidies of March." These puzzles isolate data wrangling tasks into bite-sized pieces to nurture core data science skills such as importing, reshaping, and summarizing data. We also provide access to puzzles and puzzle data directly in R through an accompanying Tidies of March package. I will show how this package models best practices for both data wrangling and project management.
12:06	Education	Alethea Rea	Teaching R and statistics to higher degree research students and industry professionals	Matthias Gehrke	Cassiopée
The University of Western Australia's Centre for Applied Statistics has for many years offered short courses for higher degree research students and industry-based professionals. In 2019 we have released stage one of our new programme of courses, all of which are based in R using the tidyverse approach. This presentation will describe our experience of moving from courses in base R to a tidyverse approach, highlighting advantages and disadvantages for us as educators and our students as learners. We will also describe our current programme, planned future courses and the pathways for participants based on their statistical background.
12:24	Education	Colin Rundel	ghclass: an R package for managing classes with GitHub	Matthias Gehrke	Cassiopée
In this talk we will present the details of the ghclass package which we have developed as a tool for managing statistical and data science courses with a computation component. The package is designed to automate the process of distributing and collecting assignments via git repositories hosted on GitHub. As part of this talk will discuss best practices for how these tools can be deployed in a variety of classroom settings from first year introductory courses through graduate level courses. Finally we will discuss how these tools fit into the larger context of modern statistical / data science pedagogy with a particular emphasis on the role of reproducibility in training new researchers.
12:42	Education	Mine Cetinkaya-Rundel	Data Science in a Box	Matthias Gehrke	Cassiopée
Data Science in a Box (datasciencebox.org) is an open-source project that aims to equip educators with concrete information on content and infrastructure for designing and painlessly running a semester-long modern introductory data science course with R. In this talk we outline five guiding pedagogical priniples that underlie the choice of topics and concepts introduced in the course as well as their ordering, highlight a sample of examples and assignments that demonstrate how the pedagogy is put into action, introduce `dsbox` -- the companion R package for datasets used in the course, and share sample student work and feedback. We will also walk through a quick start guide for faculty interested in using all or some of these resources in their teaching.
11:30	Multivariate analysis	Fabien Llobell	ClustBlock: a package for clustering datasets	Éric Matzner-Lober	Guillaumet 1+2
The clustering of datasets is of paramount interest in multivariate data analysis. In presence of several datasets which pertain to the same individuals but not necessarily the same variables, CLUSTATIS method (Llobell, Cariou, Vigneau, Labenne & Qannari, 2018) operates a cluster analysis of these datasets. This method stands as the core of ClustBlock package. CLUSTATIS strategy consists of a hierarchical algorithm followed by a partitioning algorithm, and yields graphical displays and indices to assess the quality of the solution. A noise cluster option can be activated, with the aim of setting aside atypical datasets. ClustBlock includes specifics functions to perform CLUSTATIS with data from Free Sorting task. An adaptation of CLUSTATIS to the data from a Check All That Apply task, called CLUSCATA (Llobell, Cariou, Vigneau, Labenne & Qannari, 2019), is also available. References Llobell, F., Cariou, V., Vigneau, E., Labenne, A. & Qannari, E. M. (2018). Analysis and clustering of multiblock datasets by means of the STATIS and CLUSTATIS methods. Application to sensometrics. Food Quality and Preference. Llobell, F., Cariou, V., Vigneau, E., Labenne, A., & Qannari, E. M. (2019). A new approach for the analysis of data and the clustering of subjects in a CATA experiment. Food Quality and Preference, 72, 31-39.
11:48	Multivariate analysis	Margot Selosse	ordinalClust: an R package for analyzing ordinal data.	Éric Matzner-Lober	Guillaumet 1+2
Ordinal data are a specific kind of categorical data occurring when the levels are ordered. They are used in a lot of domains, specifically when measurements are collected from persons by observations, testings, or questionnaires. ordinalClust is an R package that proposes efficient tools for modeling, clustering, co-clustering, and classification for ordinal data. The data are modeled through the BOS distribution, a gaussian-like distribution parameterized with a position and a precision parameter. On one hand, the co-clustering framework uses the Latent Block Model (LBM) and an SEM-Gibbs algorithm for the parameters inference. On the other hand, the clustering and the classification methods follow on from simplified versions of this algorithm.
12:06	Multivariate analysis	Amandine Schmutz	funHDDC, a R package to cluster univariate and multivariate functional data	Éric Matzner-Lober	Guillaumet 1+2
The emergence of numerical sensors in many aspects of everyday life leads to the collection of high frequency data. For example in sports, athletes wear devices that collect simultaneously several variables during their training to follow their physical constants. This kind of data can be classified as multivariate functional data. To ease the understanding of those data, there is an increasing need of methods to analyze multivariate functional data. This work presents an R package that provides a clustering technique (Schmutz et al, 2018) that eases the modeling and comprehension of those multivariate functional data. This method, named funHDDC, is based on a functional latent mixture model which allows the partition of data into homogeneous clusters, and an EM algorithm for model inference. In addition to clustering algorithm, the package provides model selection criteria for choosing the number of clusters, and allows the execution of principal component analysis for multivariate functional data. The package usage will be shown on several practical examples whose an original example of horse speed prediction.
12:24	Multivariate analysis	Maikol Solís	Using the package `simple features` (sf) for sensitivity analysis	Éric Matzner-Lober	Guillaumet 1+2
The curse of dimensionality is a commonly encountered problem in statistics and data analysis. Variable sensitivity analysis methods are a well studied and established set of tools designed to overcome these sorts of problems. In most cases, those methods require a functional or computer code producing the output variable. Also, they fail to capture relevant features and patterns hidden within the geometry of the enveloping manifold projected onto a variable. In this talk we propose an index that captures, reflects and correlates the relevance of distinct variables within a model by focusing on the geometry of their projections. We first construct the 2-simplices of a Vietoris-Rips complex and then estimate the area of those objects from a data-set cloud. Afterwards, through affine transformations we propose our geometric sensitivity index. As a side result, we could also estimate the regression curve of the data-set generate solely by the estimated manifold. We compile all the analysis in an original package called TopSA (https://github.com/maikol-solis/TopSA), short for Topological Sensitivity Analysis. The package exploits the facility of the package `simple features` (sf) to handle all the geometric operations. The package returns the estimated indexes and has implemented a plot function using `ggplot` together with `geom_sf`.
12:42	Multivariate analysis	Michael Friendly	Visualizing multivariate linear models in R	Éric Matzner-Lober	Guillaumet 1+2
This talk reviews our recent work in the development of visualization methods in R for understanding and interpreting multivariate linear models (MLMs). We begin with a description and examples of the HE plots framework, utilizing the heplots package, wherein multivariate tests can be visualized via ellipsoids in 2D, 3D. HE plots provide visual tests of significance: a term is significant by Roy's test if and only if its H ellipsoid projects somewhere outside the E ellipsoid. These graphical methods apply equally to all multivariate designs: MMRA, MANOVA, and MANCOVA, and the test of any linear hypothesis can also be displayed in an HE plot. When the rank of the hypothesis matrix for a term is 2 or more, these effects can also be visualized in a reduced-rank canonical space via the candisc package, which also provides new data plots for canonical correlation problems. These plots quite often provide a very simple description of differences in means among groups in a MANOVA setting, by displaying the projections of response variables as vectors in canonical space. We also describe some recent extensions of these ideas: * We extend the visualization methods to robust MLMs. * Influence measures and diagnostic plots for MLMs have now been implemented in the mvinfluence package. * We develop visual tests of equality of covariance matrices for MANOVA.
14:00	Shiny 2	Vincent Guyader	Golem : A Framework for Building Robust & Production Ready Shiny Apps	Colin Fay	Concorde 1+2
Shiny has established itself as an essential communication tool in the datascience field, and has gone beyond the scope of simple R users. Its ease of use makes it possible to produce "quick and dirty" applications. However, as soon as we try to go beyond the proof-of-concept and produce applications with professional vocation, some development and deployment aspects must be taken into account. Designing a shiny application that is maintainable over time, scalable, tested, robust and deployable is not an easy task. And like everything R-related, there are more wrong ways than right ways to do so. With the {golem} package, we suggest a development template that relies on a R package. It contains a series of tools that allows you to add functionalities to your application for both development and deployment. For example `golem::add_module()` allows to add a Shiny module, `golem::use_favicon()` allows to add a favicon to the application. In addition, functions such as `golem::add_rconnect_file()` or `golem::gen_dockerfile()` allows you to easily deploy your application on RSconnect or shinyproxy. {golem} is an opinionated template designed to help easy development, deployment and maintainance inside a documented R package.
14:18	Shiny 2	Kelly Obriant	Art of the Feature Toggle: Patterns for maintaining and improving Shiny applications over time	Colin Fay	Concorde 1+2
Creating one-off shiny applications is easy to do, but what happens when you need to maintain an application over a longer period of time? How do you introduce new application features to an active user-base without disrupting their experience? These can be difficult questions to answer for data scientists who are unfamiliar with the basic principles of software and web application deployments and release cycles. I'll introduce a few core concepts used in the software development world: environment-based and application-based release patterns, and talk about how we can map those techniques to simpler Shiny application maintenance. Finally, I'll demo a Shiny application that leverages `session$user` data to mimic feature toggles for the roll-out of a new feature to a set of users.
14:36	Shiny 2	Alexandra Turcan Ruan Pearce-Authers	Data for all: Empowering teams with scalable Shiny applications	Colin Fay	Concorde 1+2
Shiny, alongside packages like dplyr and ggplot2, offers an unparalleled developer experience for creating self-service analytics dashboards that empower teams to make data-driven decisions. However, out of the box, Shiny is not well-suited to deployment in a multi-user environment. As part of our mission to establish a data culture in a game development studio, we wanted to deploy a suite of Shiny dashboards such that exploring player behaviour became part of every team's workflow. In this talk, we will discuss the architecture of the supporting cloud infrastructure, including packaging, service orchestration, and authentication. Also, we will show how we've adapted Shiny to a multi-user environment using its new support for promises in combination with the future package. Integrating Shiny into this production-grade architecture allows for a streamlined data science workflow that enables data scientists to focus on creating dashboard content with a built-in code review process, and also to deploy changes to production in a button click. We hope to demonstrate how any data-driven organisation can augment their team-wide workflow by leveraging this end-to-end Shiny pipeline.
14:54	Shiny 2	Filip Stachura	Best practices for building Shiny enterprise applications	Colin Fay	Concorde 1+2
Shiny is conquering the enterprise world. Shiny apps are now built for hundreds of users who need reliability and functionality. Having built and scaled a number of Shiny applications in production, we've learned what's important for a Shiny app to become a stunning success. We'd like to share our experiences with you. This presentation will give you a deep dive into the best practices for building Shiny production apps, including: * How to design app architecture for optimal maintainability and development velocity. * Approaches for avoiding errors in production. * Test pyramid: why it pays off to have automated end-to-end tests and unit tests, and how to minimize manual testing. * How to scale Shiny apps to hundreds of users and set up automated performance testing. * Deployment strategies and the path from deployment, involving many manual steps, to full automation. Shiny allows you to rapidly iterate on functionality and build impressive dashboards, but it's not easy to do a great job getting that through to production. In this talk, we will discuss what specific steps you can take to achieve reliability without sacrificing speed of development.
14:00	Applications 2	Peter Jaksons	Using AI and R to help improve the quality and health of your personalised food basket.	Laura Acion	Cassiopée
At Plant and Food Research we constantly aim to improve to quality and health properties of various fruit and vegetables. Genomic prediction is the science that aims to predict certain fruit and vegetable characteristics based on its DNA. For example, can we use the DNA of an apple to predict if it will taste sweet or has a high vitamin C concentration? Ultimately, can we use the DNA to predict if a consumer might say "this apple is tasty" or "I like the beneficial health effects"? To tackle this daunting problem of linking a plant's DNA to a consumer's spoken word, we need to know everything about the plant, the consumer, and all the steps in between. In this presentation I will discuss the challenges we have to overcome and how R and AI has helped us with genomic prediction by collecting vital information at higher volumes and accuracy that was not possible before. Recent examples are an app we developed that automates sentiment analysis of consumer data, image recognition tools to monitor plants and deep learning methods to evaluate fruit performance in storage rooms.
14:18	Applications 2	Sarah N. Musy	Variation of patient turnover on a 30-minutes basis for 3 years: analysis of routine data of a Swiss University Hospital	Laura Acion	Cassiopée
Introduction Hospitals face widely varying care demands intensifying nurses' work, since the number of admissions, discharges and transfers within and between units is fluctuating throughout the day. The objective was to describe patients' movements for a large Swiss University Hospital. Methods Data from 2015 to 2017 were used containing patients' movements time and dates. With padr package, continuous time was extracted between each movement of the patient for an interval of 30 minutes. The first and last movements were recognized as admission and discharge, respectively. For the transfers, the use of the functions lead() and lag() from dplyr package were used. The goal was to identify for each unit if the patient was coming in or out. Finally, the amount of entries (admissions and transfers in) was plotted against exits (discharges and transfers out) on a pyramid bar chart using ggplot2 and scales packages. Results Ten departments and 77 units were analyzed with 85,706 patients' observations. The final data comprised of more than 26 million data points. Patient flow varies considerably. While for some units entries and exits occur at the same time, other units have volatile periods with different times of the day where entries and exits occur leading to substantially variation in patient load. Conclusion It is the first analysis of patient movements conducted on this level of granularity.
14:36	Applications 2	Hyesop Shin	Bridging agent-based modelling and R with nlrx: simulating pedestrian's long-term exposure to air pollution	Laura Acion	Cassiopée
Agent-based modelling (ABM) is a bottom-up simulation for simulating actions and interactions between agents. This approach it is particularly useful when modelling human-environment interactions, such as pedestrian's cumulative exposure levels related to travel patterns. Until recently, possible ways to report results were either to bring the text file after simulating the model that is tedious and time-consuming, or to use packages where the codes can only access the ABM platform, which slows the simulation speed (especially for geospatial models). Here, we use a newly released nlrx package that improves the modelling speed, iterates multiple jobs, and provides algorithms for temporal sequence plotting. From our pre-built pedestrian exposure model of Gangnam, Seoul, we demonstrate sequences of pedestrian's health status in every 100 ticks using gganimate, and plot risk population change in geographical hierarchies. Finally, we will discuss our efforts to encourage geographers and modellers to use R as a platform to conduct ABM studies.
14:54	Applications 2	Hae-Yoon Jung	Simulation of the physical movement for Machine Learning with R: Simulation of Robot gait Optimization Using GA	Laura Acion	Cassiopée
Recently, machine learning algorithms are used popularly in the engineering field, and simulation of the specific situation becomes one of the important processes. In this talk, we propose an environment for simulation of the physical movement in robots with R. We carried out the optimization of robot gait which is a major issue in the robotics area. As an optimization algorithm, GA (Genetic Algorithm) was used. In our physical simulation, we considered two options. The first option is about position and velocity when forces are applied to the body. We solved this dynamic equation of a robot's body placed in 2-dimensional space using the 3rd order Runge-Kutta method. The second option is for constraining the position and velocity when a body contacts the ground. The sequences of the robot's body position were visualized with animation package in R. This virtual robot has a body, and five parameters determine its walking trajectory. GA package in R was used to optimize these parameters. We successfully get some values that enable robots to walk steady and fast. Through this study, we expect simulation in robotics engineering area can be conducted with R as well.
14:00	Social science, marketing & business	Frédéric Vanwindekens	Visualisation of open-ended interviews through qualitative coding and cognitive mapping	Giovanna de Vincenzo	Caravelle 2
Open-ended interviews are common approaches in social sciences for catching the respondents' worldviews, feeling and knowledge. Their popularity is growing in natural science, particularily for studying social-ecological systems like farms, forests or fisheries. In these systems, decision-making processes and practices are no easily taken into account by classical experimental approaches. We developed an original approach that aims to visualize and analyse perceptions, knowledge and practices of managers in social-ecological systems. Based on the usage of R, we contributed to and developed two Shiny Applications that aims to (i) qualitatively code textual documents, like transcribed open-ended interviews ('qcoder') and to (ii) treat the outputs of the qualitative coding in order to draw cognitive maps ('cogmapr'). Cognitive maps are digraphs of variables that can be used as a semi-qualitative model of interviewees' worldviews. The "cogmapr" tool contains major functions from our "Cognitive Mapping Approach for Analysing Actors' Systems of Practices" : drawing Individual Cognitive Maps, computing Social Cognitive Maps and main indicators from the graph theory. The presentation will be supported by case studies from various agroecological systems : grassland, forest in Belgium and Romania, crop diversification in Europe, soil management in Québec.
14:18	Social science, marketing & business	Chris Chapman	choicetools: a package for conjoint analysis and best-worst surveys	Giovanna de Vincenzo	Caravelle 2
We present choicetools, a new package to work with data from conjoint analysis and best-worst (aka MaxDiff) stated choice surveys. Users may supply their own choice data, or import data from two popular survey platforms, Qualtrics (best-worst surveys) and Sawtooth Software Lighthouse Studio (conjoint and best-worst). choicetools estimates classical and hierarchical Bayesian models, performs preference share market simulation, and plots aggregate and individual-level estimates. It is also useful to teach conjoint analysis; it provides functions to create and field a survey in CSV format to be completed in a spreadsheet application in a classroom setting, and to estimate preferences from the responses. In our UseR! talk, we demonstrate a vignette style choice-based conjoint project using simulated responses, including survey specification, spreadsheet-style presentation, estimation, and plotting. We briefly discuss experimental features for attribute importance, and answer audience questions. This talk will be of interest to research practitioners in marketing, transportation, and other fields, and to academics who teach discrete choice and conjoint analysis methods.
14:36	Social science, marketing & business	Andreas Alfons	Robust mediation analysis using the R package robmed	Giovanna de Vincenzo	Caravelle 2
Mediation analysis is one of the most widely used statistical techniques in the social and behavioral sciences. In its simplest form, a mediation model allows to study how an independent variable (X) affects a dependent variable (Y) through an intervening variable that is called a mediator (M). Such an analysis is often carried out via a pair of regression models, in which case the indirect effect of X on Y through M can be computed as a product of coefficients from those regression models. The standard test for this indirect effect is a bootstrap test based on ordinary least squares (OLS) regressions. However, this test is very sensitive, e.g., to outliers or heavy tails, which poses a serious threat to empirical testing of theory about mediation mechanisms. The R package robmed implements a robust test for mediation analysis based on the fast and robust bootstrap methodology for robust regression estimators. This procedure yields reliable results for estimating the effect size and assessing its significance, even when the data deviate from the usual normality assumptions. In addition to simple mediation models, the package also provides functionality for mediation models with multiple mediators as well as control variables. Furthermore, the standard bootstrap test and other proposals are included in the package. We will demonstrate the use of package robmed in an example from management research.
14:54	Social science, marketing & business	Gert Janssenswillen	propro: Enhancing Discovered Process Models using Bayesian Inference and MCMC	Giovanna de Vincenzo	Caravelle 2
Process mining is an innovative research field aimed at extracting useful information about business processes from event data. An important task herein is process discovery; the discovery of process models from data. The results of process discovery are mainly deterministic process models, which do not convey a notion of probability or uncertainty. In this paper, Bayesian inference and Markov Chain Monte Carlo is used to build a statistical model on top of a process model using event data, which is able to generate probability distributions for choices in a process' control-flow. A generic algorithm to build such a model is presented, and it is shown how the resulting statistical model can be used to test different kinds of hypotheses, such as non-deterministic dependencies between different choices in the model. The algorithm is implemented in a new R package, named propro, for probabilistic process models. In this paper, it is shown how propro can be used in real-life process analysis scenarios and leads to valuable information about the processes under consideration, which go beyond the discovery of static control-flow. As a result, propro supports the enhancement of discovered process models by exposing probabilistic dependencies, and allows to compare the quality among different models, each of which provides important advancements in the field of process mining.
14:00	Movement & transport	Rocio Joo	Navigating through the R packages for movement	Angela Li	Ariane 1+2
The advent of biologging devices has led to the collection of ever-increasing quantities of tracking data from animals and humans. In parallel, sophisticated tools to process, visualize and analyze tracking data have been developed in abundance. Within the R software, we listed 57 packages focused on these tasks, called here tracking packages. Here we review and describe each package based on a workflow centered around tracking data, broken down in three stages: pre-processing, post-processing and analysis (data visualization, track description, path reconstruction, behavioral pattern identification, space use characterization, trajectory simulation and others). Links between packages are assessed through a network graph analysis and show that one third of the packages worked on isolation, reflecting a fragmentation in the R movement-ecology programming community. Finally, we provide recommendations for users to choose packages and for developers to maximize the usefulness of their contribution and strengthen the links between the programming community.
14:18	Movement & transport	Mohammad Mehdi Moradi	Classes, methods and data analysis for trajectories	Angela Li	Ariane 1+2
Object tracking has been recently considered drastically to get insight into the behavior of moving objects within different contexts such as eye tracking, animal tracking, traffic analysis, city management, sports, etc. This, however, results in a huge amount of irregular data which are usually not well structured, depending on the tracking methods and the aim of the study. In this work, we review different classes and methods from R package trajectories with the aim of structuring, handling and summarizing movement data. Simulation techniques and model fitting are also studied. We further proceed with some statistical characteristics based on spatial point processes such as first- and second-order summary statistics to analyze the behavior of objects over time in terms of distribution and pairwise interaction. We finally demonstrate our findings of a taxi movement data from Beijing, China.
14:36	Movement & transport	Christine Thomas-Agnan	Modelling spatial flows with R	Angela Li	Ariane 1+2
We present an R implementation of spatial interaction models. Flow data represent movements of people or goods between two spatial locations, for example in migration, international trade, transportation. Gravity models, which have been extensively used for this purpose, include a function of distance between origin and destination among the explanatory variables to account for the spatial dimension. To further eliminate the spatial structure present in this type of data, we implement two fitting methods for spatial autoregressive models accounting for spatial dependence between flows: a Bayesian approach and a three stage least squares approach. We discuss impacts measures evaluation as well and extend these computations to the case of different characteristics at origin and destination. Contrary to existing programs, our implementation with the R language also allows for a different list for origins and destinations. We illustrate with an application to air transportation data. LeSage, J. P., and Thomas-Agnan, C. (2015). Interpreting spatial econometricorigin-destination flow models. Journal of Regional Science, 55(2),188-208. Margaretic, P., Thomas-Agnan, C., & Doucet, R. (2017). Spatial dependence in (origin-destination) air passenger flows. Papers in Regional Science, 96(2), 357-380.
14:54	Movement & transport	Robin Lovelace	R for Transport Planning	Angela Li	Ariane 1+2
Since the first release of R on CRAN, in 1997, its use in many fields has grown rapidly. Lai et al. (2019), for example, suggest that more than 50% of research articles published in Ecology use R in some way. Much like many ecological datasets, transport data tend to be large, diverse and have spatial and temporal attributes. Unlike Ecology, Transport Planning has been a slow adopter of R, with a much lower percentage of papers using the language. This raises the question: why? After exploring this question, in relation to dominant transport planning software products, the talk will sketch of what an open source transport planning 'ecosystem' could look like. Based on my own experience, of developing the stplanr package and teaching practitioners, the talk will discuss the importance of building "communities of practice", for transport planners making the switch to R. These observations relate to others promoting R in new environments, and link to the wider question of how to advocate for open source software in wider society. Lai, J., Lortie, C.J., Muenchen, R.A., Yang, J., Ma, K., 2019. Evaluating the popularity of R in ecology. Ecosphere 10.
14:00	Reproducibility	Quentin Fazilleau	Package flextable: a grammar to produce tabular reporting from R	Emma Rand	Saint-Exupéry
The flextable package provides an interface to generate tables for publication and corporate reporting. Originally designed to work with the package officer, it has evolved this year to get compatible with the R Markdown format. The package enables simple and complex tabular reporting composition thanks to a user-friendly grammar. It supports output to HTML, Word, PowerPoint and recently PDF through the pagedown package. In this talk I will introduce the main concepts of the package and demonstrate them with simples examples. I will show how to manage the layouts, how to format the content and mix images with text. Finally, I will expose a concrete implementation inside an R Markdown report and a Shiny application.
14:18	Reproducibility	Leah Welty	Connecting R/R Markdown and Microsoft Word using StatTag for Collaborative Reproducibility	Emma Rand	Saint-Exupéry
Although R Markdown can render documents to Microsoft Word, R/R Markdown users must sometimes transcribe statistical content in to separate Microsoft Word documents (e.g., documents drafted by colleagues in Word, or documents that must be prepared in Word), a process that is error prone, irreproducible, and inefficient. We will present StatTag (www.stattag.org), an open source, free, and user-friendly program we developed to address this problem. StatTag establishes a bi-directional link between R/R Markdown files and a Word document, and supports a reproducible pipeline even when: (1) statistical results must be included and updated in Word documents that were never generated from Markdown; and (2) text in Word files generated from R/R Markdown has departed substantially from original Markdown content, for example through tracked changes or comments. We will demonstrate how to use StatTag to connect R/R Markdown and Word files so that all files can be edited separately, but statistical content – values, tables, figures, and verbatim output – can be updated automatically in Word. Using practical examples, we will also illustrate how to use StatTag to view, edit, and rerun R/R Markdown code directly from Word.
14:36	Reproducibility	Sébastien Rochette	The "Rmd first" method: when projects start with the documentation	Emma Rand	Saint-Exupéry
Analysis, R programs or packages are easier to use when correctly documented. However, documentation is perceived time consuming and is the poor child of development projects. We assume that any kind of R project (data analysis, R package, shiny applications) can embrace the literate programming paradigm: explanation of the program logic in natural langage interlaced with code snippets. Hence, Rmarkdown files are ideal candidates for reproducible projects and workflows. The "Rmd first" approach proposes a project in four parts: (1) Prototype: keep track of data explorations, graphs and tables using Rmd files. (2) Package: regularly transform code chunks into functions to simplify and clarify reading. Extracted functions are documented and tested. The Rmd file ends as the vignette of the package. (3) Modularize: separate projects into multiple sub-parts, hence Rmd files. It simplifies participation of multiple developers by reducing potential files conflicts. (4) Deploy: use available tools to share your work, confidentially or publicly. In a vignette documented package, automated report may be built with {bookdown}, a website may present the full documentation with {pkgdown}. This approach allows developers to explain their process during development. At low cost, documentation is already available, the analysis is directly useable and reproducible, the report already written.
14:54	Reproducibility	Peter Baker	R gnumaker: easy Makefile construction for enhancing reproducible research	Emma Rand	Saint-Exupéry
Is there a crisis in reproducible research? Some studies such as Ioannidis et. el (2009) have estimated that over fifty percent of published papers in some fields of research are not reproducible. In statistical and data analysis projects, this appears to be due to lack of training and poor tools rather than scientific fraud. In the R community, there is a move towards Don't Repeat Yourself (DRY) approaches to reproducible research and reporting. Employing computing tools like GNU Make to regenerate output when syntax or data files change, GNU git for version control, writing GNU R functions or packages for repetitive tasks and R Markdown for reporting can greatly improve reproducibility. The gnumaker package is ideal for statisticians and data analysts with little experience of Make. By specifying the relationships between syntax and data files gnumaker makes it easy to create and plot GNU Makefiles as DAGs. Gnumaker employs Make pattern rules for R, Sweave, R Markdown, Stata, SAS, etc outlined in P Baker (2019) Using GNU Make to Manage the Workflow of Data Analysis Projects, Journal of Statistical Software (Accepted). See https://github.com/petebaker
14:00	Bioinformatics 1	Marijke Van Moerbeke	Multi-data learning with M-ABC: extending your ABC's	Mélina Gallopin	Guillaumet 1+2
The paradigm of the current revolution in biomedical studies, and life-sciences in general, is to focus on a better understanding of biological processes. In the early stages of drug development, different types of information on the compounds are collected: the chemical structures of the molecules, the predicted targets (target predictions), various bioassays, the toxicity and more. An analysis of each data set could reveal interesting yet disjoint information. An important task is the integration of the data sets from the experiments to understand the working mechanism of the drug. Multi-source clustering methods aim to discover groups of objects that are consistently similar across data sets. We introduce a new multi-source clustering method, M-ABC, which applies a consensus function generating a clustering of clusters. This can improve the quality of individual clustering algorithms. The proposed method is implemented and publicly available in the R package IntClust which is a wrapper package for a multitude of ensemble clustering methods. In addition, visualization for a comparison of clustering results and cluster specific secondary analyses such as differential gene expression and pathway analysis are available.
14:18	Bioinformatics 1	Toby Hocking	Fast and Optimal Peak Detection in Large Genomic Data via the PeakSegDisk Package	Mélina Gallopin	Guillaumet 1+2
We describe a new algorithm and R package for peak detection in genomic data sets using constrained optimal changepoint algorithms. These detect changes from background to peak regions by imposing the constraint that the mean should alternately increase then decrease. An existing algorithm for this problem exists, and gives state-of-the-art accuracy results, but it is computationally expensive when the number of changes is large. We propose an algorithm with empirical O(N log N) time complexity that jointly estimates the number of peaks and their locations by globally minimizing a non-convex penalized cost function. We also propose a sequential search algorithm that finds the best solution with K segments in O(N log(N) log(K)) time, which is much faster than the previous O(K N log N) algorithm. Our empirical results show that our disk-based implementation in the PeakSegDisk R package can be used to quickly compute constrained optimal models with many changepoints, which are needed to analyze typical genomic data sets that have tens of millions of observations.
14:36	Bioinformatics 1	Guillaume Devailly	PEREpigenomics: a shiny app to visualize Roadmap Epigenomics data	Mélina Gallopin	Guillaumet 1+2
The Roadmap Epigenomics consortium has gathered and produced abundant epigenomic data in human cell lines and tissues to build reference epigenetic maps. The data is freely available, and is exploitable through web browsers. It has been used by the consortium to classify chromatin into "states" that can be more easily compared across cell type. Here we systematically assessed the links between epigenetic marks and transcription by generating "stacked profiles" of epigenetic marks at transcription start sites (TSS), transcription end sites (TES) and middle exons, sorted according to expression levels or exon inclusion ratio, for each cell type and tissue in the dataset. The thousands of attractive visualisations generated are made easily browsable through a web application, www.perepigenomics.roslin.ed.ac.uk. We also investigated how change of epigenetic marks across cells were correlated to change of expression level / inclusion ratio, allowing us to build a comprehensive understanding of the associations between epigenetic marks at TSS, TES and middle exons and expression levels or exon inclusion ratio.
14:54	Bioinformatics 1	Vahid Nassiri	clustDRM: an R package and Shiny app for modeling high- throughput dose-response data	Mélina Gallopin	Guillaumet 1+2
Dose-response modeling is a crucial step in drug discovery and safety assessment. R packages like "drc" and "DoseFinding" provide useful tools to fit dose-response models and estimate parameters such as effective doses. When it comes to modeling dose-response patterns of several compounds (e.g., in high content screening studies), one may face with three different challenges: 1. the dose-response relationship may be non-existent (flat patterns), 2. no single dose-response model fits the data well enough, 3. due to the use of often complex non-linear models the estimation could become computationally intensive (especially in a high-throughput setting). To address these issues, we have developed an R package called 'clustDRM` that provides a unified platform to analyze dose-response relationships in a high-throughput setting. The package uses a two-stage approach. First, it filters out the compounds with a flat dose-response pattern and identifies the patterns of the non-flat ones. Next, it fits a set of appropriate dose-response models (accounting for the previously identified patterns) and uses model-averaging to estimate effective doses and their standard errors. Parallel computations are used to address the third issue. Furthermore, a Shiny app accompanies the package to make its use easier for scientists from various fields without deep knowledge of R.
16:00	Keynote	Julie Josse	A missing value tour in R	Diane Beldame	Concorde 1+2
In many application settings, the data have missing features which make data analysis challenging. An abundant literature addresses missing data as well as more than 150 R packages. Funded by the R consortium, we have created the R-miss-tastic plateform along with a dedicated task view which aims at giving an overview of main references, contributors, tutorials to offer users keys to analyse their data. This plateform highlights that this is an active field of work and that as usual different problems requires designing dedicated methods. In this presentation, I will share my experience on the topic. I will start by the inferential framework, where the aim is to estimate at best the parameters and their variance in the presence of missing data. Last multiple imputation methods have focused on taking into account the heterogeneity of the data (multi-sources with variables of different natures, etc.). Then I will present recent results in a supervised-learning setting. A striking one is that the widely-used method of imputing with the mean prior to learning can be consistent. That such a simple approach can be relevant may have important consequences in practice.

Time	Session	Speaker	Title	Chair	Room
09:15	Keynote	Joe Cheng	Shiny's Holy Grail: Interactivity with reproducibility	Vincent Guyader	Concorde 1+2
Since its introduction in 2012, Shiny has become a mainstay of the R ecosystem, providing a solid foundation for building interactive artifacts of all kinds. But that interactivity has always come at a significant cost to reproducibility, as actions performed in a Shiny app are not easily captured for later analysis and replication. In 2016, Shiny gained a "bookmarkable state" feature that makes it possible to snapshot and restore application state via URL. But this feature, though useful, doesn't completely solve the reproducibility problem, as the actual program logic is still locked behind a user interface. In this talk, I'll discuss some of the approaches that app authors have taken to achieve these ends, along with some surprising and exciting approaches that have recently emerged. These new approaches usefully decrease the implementation effort and code duplication, and may eventually become essential tools for those who wish to combine interactivity with reproducibility.
10:25	Workflow & development	Brent Thorne	Transitioning between various RMarkdown packages for workflow optimization in academic research; a graduate student's perspective.	Henrik Bengtsson	Concorde 1+2
Reproducible documentation and the use of RMarkdown is becoming more prevalent in the academic world. Packages built on RMarkdown such as posterdown, xaringan, pagedown, and rticles allow for most aspects of a typical research project to be produced; however, newcomers to RMarkdown can be put off or discouraged by the inconsistencies in which each type of document needs to be formatted in the '.Rmd' file. This talk will provide insight into the benefits and downfalls of fully customizable RMarkdown packages with emphasis on (1) the importance of a timely workflow for the short duration of an average graduate student program; (2) introducing students or supervisors who are new to RMarkdown, and (3) show the importance of reproducibility in every stage of a good research project. Concepts from this presentation intend to spark conversation and show the importance of reproducible document generation in all stages of a research project.
10:30	Workflow & development	Paul Stevenson	An Approach to Project Workflow for Professional Biostatistical Services	Henrik Bengtsson	Concorde 1+2
Large research groups commonly employ a biostatistician to work across their portfolio of research projects, however, this is not feasible for many research active clinicians and "low-profile/establishing" research groups who often struggle to access biostatistical support. Our group offers "low-barrier" initial consultations for analysis, data management and database development on a "per-project" basis that is attractive to the local health research community as it makes biostatistical expertise affordable and easily accessible. To facilitate our workflow, we have developed and refined a template project (skeleton) in R (using the "ProjectTemplate" package) that, along with version control systems, conforms to the principles of reproducible research. We have also developed R markdown templates to produce documents following our Institute's style guide. This approach allows us to streamline our workflow to expeditiously initiate projects and produce professional looking reports in multiple formats directly from the analysis package without wasting time on the non-analytical aspects of our projects; this approach is identical and successful for both simple and large-scale projects.
10:35	Workflow & development	Ildiko Czeller	ropsec: a package for easing operations security for the R user	Henrik Bengtsson	Concorde 1+2
Applying security best practices is essential not only for developers or sensitive data storage but also for the everyday R user installing R packages, contributing to open source, working with APIs or remote servers. However, keeping up-to-date with security best practices and applying them meticulously requires significant effort and is difficult without expert knowledge. The goal of the R package ropsec (github.com/ropenscilabs/ropsec) is to bring some of the best practices closer to all R users and enabling them to add a few more layers of security to their personal workstation and shared work. In this talk I will focus on signing commits: why you should do it and how ropsec helps you do it the right way with the lowest possible risk of making a mess of your settings. I will also highlight how can you reliably test an R package whose core functionality is changing settings outside your R project. Work on ropsec started at the 2018 ropensci unconf (unconf18.ropensci.org) and the package continuously improved since then. ropsec leverages gpg (on CRAN) that provides low-level functions for signing commits; the value added comes from the collection of high-level functions, the thorough documentation and the intuitive workflow that help R users to solve end-to-end use cases. Its aim is to mitigate the risk of doing something the user does not intend to do due to the complexity of the low level operations.
10:40	Workflow & development	Nicoletta Farabullini	compareWith - user-friendly diff viewing and VCS interaction	Henrik Bengtsson	Concorde 1+2
Version control systems provide an important environment for controlled code and software development. In the case of R and RStudio however, the integration of version control tools is still significantly behind what is desirable. This has become a common typical concern for the whole community, especially with the uprise of Github and git for open-source R package development, where issues are being tackled through separate branches and contributors are more numerous and heterogenous. Command-line git interaction can be an additional barrier and so individuals and organizations have sought different ways to deal with the shortcomings. In this talk, we propose a flexible light-weight combination with Meld (http://meldmerge.org/), an open-source visual diff and merge tool, and demonstrate "compareWith" (https://github.com/miraisolutions/compareWith), an R package that allows users to interact with Meld from within RStudio. compareWith provides user-friendly addins that enable and improve tasks that are otherwise difficult or impossible without any custom extension. Examples include: i) compare differences prior to commit, for single active files or the whole project; ii) resolve and merge conflicts via three-way comparison; iii) compare 2 distinct files with each other. Even simple tasks benefit from the improved diff viewing tool compared to what is built into RStudio.
10:45	Workflow & development	Hannah Frick	goodpractice - A Tool for Good Package Development	Henrik Bengtsson	Concorde 1+2
Building an R package is a great way of encapsulating code, documentation and data, in a single testable and easily distributable unit. Whether you work by yourself or with others, the goal is always to keep your code easily maintainable and bug-free. R CMD check offers a set of checks on the source code of a package to ensure a quality standard required for packages on CRAN. However, it does not cover other aspects of writing good quality software such as code complexity (1) and does not require testing. The goodpractice package leverages several R packages addressing these aspects and, in one place, calculates code coverage (via covr) and cyclomatic complexity (via cyclocomp), runs linters (via lintr), includes all checks from R CMD check (via rcmdcheck) and gives further advice on good practices for R packages, e.g., to include a URL for a bug tracker. The package currently contains 230 checks to help package developers write high quality packages to a common standard. It is both configurable and extensible, so you can use it with your custom set of checks. (1) TJ McCabe (1976) A Complexity Measure. IEEE Transactions on Software Engineering (SE-2:4).
10:50	Workflow & development	Jakob Richter	rt - R Tools for the Command Line	Henrik Bengtsson	Concorde 1+2
rt - R tools for the command line - is an R package containing a collection CLI programs to simplify R package development, daily routines and managing the R user library. It runs on Unix-alikes, macOS and Windows. rt packs many basic operations in handy command line instructions, effectively isolating tasks in a separate R process, to not stand in the way of your coding experience. Developing R packages often is a tedious matter involving many repetitive tasks, which mainly are testing and checking. Reducing the effort for those tasks sums up to a more efficient workflow and more available time for coding. rt allows you to test, check, build and spellcheck packages, upload them to the winbuilder and rhub from the command line. Simple routines like installing packages from CRAN or GitHub, knitting documents, starting shiny apps and updating the package library can be run directly from the shell. For users that maintain R installations on multiple machines, a dotfile configuration allows them to keep the package libraries synchronous. Project Page: https://github.com/rdatsci/rt
10:25	Text mining	Dmytro Perepolkin	{polite} - web etiquette for R users	Riva Quiroga	Cassiopée
Data is everywhere, but it does not mean it is freely available. What are best practices and acceptable norms for accessing the data on the web? How does one know when it is OK to scrape the content of a website and how to do it in such a way that it does not create problems for data owner and/or other users? This lightning talk with introduce {polite} package – a collection of functions for safe and responsible web scraping. The three pillars of {polite} are seeking permission, taking slowly and never asking twice.
10:30	Text mining	Samuel Borms	The R Package sentometrics to Compute, Aggregate and Predict with Textual Sentiment	Riva Quiroga	Cassiopée
We provide a hands-on introduction to optimized textual sentiment indexation using the R package sentometrics. Driven by the need to unlock the potential of textual data, sentiment analysis is increasingly used to capture its information value. The sentometrics package implements an intuitive framework to efficiently compute sentiment scores of numerous texts, to aggregate the scores into multiple time series, and to use these time series to predict other variables. The workflow of the package is illustrated with a built-in corpus of news articles from The Wall Street Journal and The Washington Post to create a selection of interesting aggregated text-based indices, and use these to forecast expected stock market volatility.
10:35	Text mining	Cécile Sauder Jean Delmotte	BibliographeR : a set of tools to help your bibliographic research	Riva Quiroga	Cassiopée
The number of scientific articles is constantly increasing. It is sometimes impossible to read all the articles in certain areas. Among this great diversity of articles, some may be more interesting than others. It is difficult to select which articles are essential in a field. The contemporary way to judge the scientific quality of an article is to use the impact factor or the number of citations. However, these parameters may lead to a lack of certain articles that are not very well cited but are very innovative. It is therefore essential to ask the question of what makes an article fundamental in a field. Using the "fulltext" package in our Shiny web application we show how the analysis of a bibliography using a network is a good way to visualize the state of the art in a field. We searched for different parameters to judge scientific quality using data science approaches. Recent research has shown that the work of small research teams can lead to scientific innovations. In this sense, the analysis of scientific articles by global techniques could play an important role in the discovery of these advances.
10:40	Text mining	Erwan Le Pennec	ggwordcloud: a word cloud geometry for ggplot2	Riva Quiroga	Cassiopée
Word clouds provide a nice mechanism to visualize a list of weighted words. R has already two dedicated packages, wordcloud and wordcloud2, that produce respectively a base plot and a html widget. ggwordcloud has been introduced last fall to propose an alternative for the ggplot2 ecosystem. It consists mainly in a new geometry geom_text_wordcloud which places words on the ggplot2 canvas using an algorithm very similar to the one used in wordcloud. It provides some extensions: a better word scaling, the use of masks as in wordcloud2 and works with ggplot2 facet system. The code has been inspired by ggrepel and relies on RCpp so that the rendering is fast. In the lightning talk, we will present the package and some of its uses. For more details, see https://lepennec.github.io/ggwordcloud/
10:45	Text mining	Chung-Hong Chan	Die Nutella oder Das Nutella? Grammatical Gender Prediction of German Nouns	Riva Quiroga	Cassiopée
One of the big challenges of learning German is determining the grammatical gender of German Nouns (Genus / grammatisches Geschlecht, e.g. der Löffel, die Gabel, das Messer). For light-hearted, the rules seem to be pretty arbitrary. In this talk, I am going to show how I created a quite (or not quite) accurate model to predict the grammatical gender of German nouns using the R package keras. During the development process, I have encountered some interesting problems about the German Language and machine learning in general.
10:50	Text mining	Johannes Müller	Implementing a Classification and Filtering App for Multilingual Facebook Comments – A Use Case Of Data For Good with R	Riva Quiroga	Cassiopée
What is the biggest challenge in data science? Some say it is messy data, others say company politics, but for us at CorrelAid one of the biggest challenges is its unused potential. Larger companies, universities or governmental organizations can afford professional data scientists, but what about civil society or NPOs? Currently, the resources to generate and use data science insights are highly imbalanced. Using R, we show how to leverage this unused potential by applying a cutting-edge multi-language classification approach. In collaboration with Minor – a German organisation offering legal advisory for marginalised groups – we present a data for good use case. We use NLP techniques such as multi-language word embeddings (word2vec), unsupervised classification (e1071, caret) and topic modeling (stm) to enable Minor volunteers to better allocate their time. We demonstrate the implementation of an interactive Shiny Dashboard app for classifying and filtering multilingual Facebook comments, including English, Bulgarian and Arabic. The filter mechanisms first identify and filter out comments in the Facebook groups which Minor monitors. The topic model then allocates comments to the respective volunteer at Minor. As a result, volunteers can spend more time on actually helping their clients on Facebook instead of sifting through unstructured comments to find the relevant cases.
10:55	Text mining	Nolwenn Le Meur	queryMed: Linking pharmacological and medical knowledge using semantic Web technologies	Riva Quiroga	Cassiopée
Care trajectory analysis from medical administrative information systems data requires integrating multiple medical data sources (hospital care, drug consumption...) stored in various specialized codifications (ICD10, CIP...), more or less detailed and not always intelligible at first glance by epidemiologist researchers. Tools and methods are needed to facilitate the annotation and the integration such codes to decipher and compare care trajectories. Because medical data are often codified according to international nomenclatures, they can be linked to knowledge representations from medical and pharmacological domains. Linked Data initiatives and Semantic Web technologies have led to the spread of knowledge representations through ontologies, thesauri, taxonomies and nomenclatures. However, the use of these standards, technologies and knowledge representations is still hesitant for non-computer scientists as non-trivial to organize and query for pharmaco-epidemiologist researchers.The R package queryMed (https://github.com/yannrivault/queryMed) provides user-friendly methods to access biomedical knowledge resources from the Linked Data. It proposes methods for the enrichment of health care data to facilitate care trajectory analysis (e.g., pharmaco-surveillance, pharmaco-vigilance), making the most of ontologies or nomenclatures properties.
10:25	Spatial & time series	Enrico Spinielli Tamara Pejovic	R in the Air	Chris Prener	Caravelle 2
Aircraft trajectories are becoming more available both publicly via ADS-B data crowd-sourced by aviation enthusiasts and non-profit organisations such as OpensSky Network and ADSBexchange, and also commercially via platforms such as FlightRadar24 and Flightaware. The trrrj R package supports import, export and analysis of aircraft trajectories (4D: 3D plus time) in the various phases of a commercial flight. The package also provides support for spatial analysis and plotting horizontal and vertical profiles. We present a real use case of trrrj application for arrival flights to several European airports by dealing with the assessment of inefficiencies (time, distance flown, fuel/CO2 emission) related to holdings.
10:30	Spatial & time series	Piotr Wójcik	Measuring inequalities from space. Analysis of satellite raster images with R	Chris Prener	Caravelle 2
Data on night-time light intensity is increasingly used by social science researchers as a proxy for economic development. Calculated from weather satellite recordings, it provides annual data for the whole globe in gridded format with pixels covering less than one square kilometer. This allows researchers to aggregate these data at the level of subnational units and analyze it together with other socio-economic indicators. Satellite images are freely available as large raster files. The aim of this presentation is to show how to analyze these data step by step in R – starting from importing the data to R, then correctly imposing a map of a selected area (e.g. shapefile) on the raster object, limiting the satellite image to the selected spatial extent and finally aggregating the data for the analyzed territorial units and visualizing the result. In addition, correlation between night-time light intensity and selected socio-economic indicators (e.g. population, GDP) will be analyzed for world countries, US states and UE regions (NUTS2 and NUTS3). R packages: dplyr, sf, raster, ggplot2, leaflet, WDI, eurostat
10:35	Spatial & time series	Florence Carpentier	SILand: an R package for estimating the spatial influence of landscape	Chris Prener	Caravelle 2
In ecology or in epidemiology, understanding how landscape structures the spatial distributions of species and populations is crucial to determine management rules for sustainable systems in agriculture and conservation biology. However, identifying landscape effects remains difficult, especially since no easy-to-use tools are available for this type of analysis. Here we present 'SILand', an R package. It is the first user-friendly tool that permits a complete and complex analysis of landscape effects, using few functions based on a "classic" syntax similar to well-known packages (such as stat and lme4). It can be used for the following: (i) quickly import GIS files in R; (ii) infer the spatial influence of landscape variables and the area of significant influence of each landscape variable and (iii) provides highly informative visualizations of the results (prediction maps).
10:40	Spatial & time series	Rodelyn Jaksons	Spatio-temporal Analysis of Diabrotica Emergence	Chris Prener	Caravelle 2
Diabrotica or more commonly known as the cucumber beetle or corn rootworm is a species of beetle. It is a major a pest and is known to cause major economic damage to corn growers. Diabrotica was previously only found in the United States and some parts of South America but is believed to have been introduced to Europe during the Yugoslav wars. Due to its potential devastating economic repurcussions it is important to understand the population life cycle of the beetles to ensure adequate controls and measures are in place to minimise impact. Through the use of the Gompertz curve and Bayesian Hierarchical Models in R, we can model the observed dynamics of the beetles emergence to infer when emergence starts and how space, time, and climactic factors affect the dynamics.
10:45	Spatial & time series	Annette Scheffer	Navigating spatial data management and analysis in Sustainable Fisheries using a combined R-Python approach	Chris Prener	Caravelle 2
Geospatial data is of increasing importance in resource management, impact assessment and decision-making. However, the variety of requirements and user experience when working with spatial data - from developing data visualisations through graphical interfaces to undertaking analyses with customized coding - necessitates tailored solutions to maximize accessibility of data and analysis tools for different user groups. The Marine Stewardship Council (MSC) is a non-profit organization providing an internationally recognized sustainable seafood ecolabel and fishery and traceability certification programs. The nature of such an organization means that multiple departments have different requirements and coding proficiency, requiring that the MSC's spatial data management and analysis strategy accommodates a variety of access and analysis tools. Here, we discuss the implementation of spatial data storage, organisation and analysis at the MSC in light of maximizing accessibility, workflow efficiency and reproducibility of results. Our implementation combines open-source analysis tools such as R, Python and qGIS together with PostgreSQL for spatial data management. R and Python scripts combined under the rPython package allow connecting to the spatial database via R or qGIS and automating repeated processes such as specific queries and analyses.
10:50	Spatial & time series	Kim Antunez	Dealing with the change of administrative divisions over time	Chris Prener	Caravelle 2
The administrative divisions of countries change over time, making it tricky to combine territorial databases from different dates. I will present two packages which help to solve this problem: 1) COGugaison [1], which provides functions for converting a spatial dataset from one year to what it would be (or have been) another year. The package handles when the territories gather and split. 2) CARTElette [2], which contains geographical layers corresponding to the annual division of French territories that can be loaded directly into R's popular {sf} format. Thanks to these two packages, it is for example possible to look at the evolution of the number of women in France in each French department since 1975 taking into account that some of the territories have changed during this period [3]. At this time, those packages only concern France and are therefore only documented in French. By presenting this problem and my current work internationally, I hope to inspire future extensions to other countries and collaborations with international spatial analysts. [1] https://github.com/antuki/COGugaison [2] https://github.com/antuki/CARTElette [3] https://antuki.github.io/slides/180306_RLadies_COGugaison_carto/180306_RLadies_COGugaison_carto.html#51
10:55	Spatial & time series	Gregor De Cillia	persephone, seasonal adjustment with an object-oriented wrapper for RJDmetra	Chris Prener	Caravelle 2
The R-package persephone is developed to enable easy processing during the production of seasonally adjusted estimates. It builds on top of RJDmetra and provides analytical tools such as interactive plots to support the SA expert. The package should make it easy to construct personalized dashboards containing selected plots and diagnostics. Furthermore it will support hierarchical time series and tackle the issue of direct vs indirect adjustment.
10:25	Open science, education & community	Saras Windecker	Open-access software for research: beyond data analysis	Shelmith Kariuki	Saint-Exupéry
Amidst growing recognition of irreproducibility in the scientific literature, we face concerns about the credibility of research and the reliability of decisions they inform. Open science practices, such as writing open-access software, encourage research practices that are both transparent and repeatable. Although open-source R packages for data analysis are increasingly common, proprietary software and black-box approaches are still common for many data collection and processing procedures. In this talk I will briefly present "mixchar", a R package alternative to financially restricted "point-and-click" methods to estimate carbon in plant litter, and discuss software development for open science in research more generally.
10:30	Open science, education & community	Angela Li	Teaching reproducible spatial analysis in R	Shelmith Kariuki	Saint-Exupéry
In this talk we will discuss a workshop taught to social scientists and econometricians at the Center for Spatial Data Science at the University of Chicago with little to no background in spatial data or programming. Unlike a conventional spatial statistics or analysis course, the workshop integrated learning to code in R with learning to think spatially. Researchers learned how to explore, manipulate, and visualize spatial data using recently-developed spatial packages in R, while at the same time learning habits for project management and reproducible research. This talk discusses pitfalls and success stories from the workshop, along with considerations when putting together a spatial data curriculum in R. It describes how teaching researchers led to increased programming literacy among researchers, as well as contributions to open source spatial packages. Finally, it puts forth suggestions for a spatial data curriculum that can be shared openly to teach spatial thinking in R worldwide.
10:35	Open science, education & community	William Chase	Use aRt to learn algorithms, math, and R	Shelmith Kariuki	Saint-Exupéry
I'm bad at math; algorithms look like a foreign language to me; and until recently, I thought of R as a tool just for statistics. All of that changed when I discovered generative art. Almost overnight I went from being afraid of math to dreaming of the Mandelbrot set and reading papers on Wang tiling algorithms. I desperately wanted to make my own art, but I had just become comfortable with R, and the idea of learning Processing or Javascript was daunting. So I barrelled forward with R, transforming my attitude towards coding and what is possible with R. In this talk I will take useRs along my journey from math-phobe to algorithm evangelizer through my "12 Months of aRt" project (which they can read about on my blog williamrchase.com). I will discuss how the beauty of generative art engaged me more than any math class, and I will inspire useRs to do something fun with R and learn in the process. Along the way, we will learn how R is a fully capable creative coding environment, and how you can leverage a wide breadth of tools such as the tidyverse, spatial libraries, and even Rcpp to turn your creative visions into reality.
10:40	Open science, education & community	Beatriz Milz	The evolution and importance of the R-Ladies São Paulo chapter in Brazil	Shelmith Kariuki	Saint-Exupéry
R-Ladies is an worldwide organization that promotes gender diversity in the R community. Brazil is a developing country in Latin America, that currently has 9 R-Ladies chapters, and the one in Sao Paulo was created in August 2018. By February 2019, it had almost 300 members, showing an impressive growth in such a small period of time. The chapter already held 7 events (four 3 hour meetups, one datathon and 2 whole day workshops), and other meetings are being planned. All of the content presented is made available on github, so anyone can access it. The group provides a safe environment for people interested in R, and all of the activities are free to attend. The gratuity of the activities is very relevant in the Brazilian context, since the few R courses available in Portuguese are often expensive, and there is no other active R group in Sao Paulo. The chapter also collaborates with other projects, through lectures and workshops, also always open to the community. The increasing popularity of the group shows us how important it is to support it, what can be the key to motivate the creation of other chapters in Brazil and increase the strength of the Brazilian R community.
10:45	Open science, education & community	Binod Jung Bogati	Building Active Community at Your Place	Shelmith Kariuki	Saint-Exupéry
The strength of R comes from its community. It's easy to get involved as a member into community (if it exists). However, building a new community or making a community active is not easy. Here, I'll share my experience (as a student) on building first R community in Nepal (now 350+ members). What were the shared struggles of my community? How I managed to overcome challenges and made welcoming & active community. Besides that, I'll also share how our community helps student learn R in academics & research. So, come and join me to know tips on building active & engaging community at your place.
10:50	Open science, education & community	Eyitayo Alimi	Scaling useR Communities with Engagement and Retention Models	Shelmith Kariuki	Saint-Exupéry
In this talk, I intend to unfold the challenges local useR communities are facing, backing up my research with the following data: How many R user groups and R ladies chapters have been founded? How many are still in existence? How many couldn't survive in Africa and how many were revived? These data will be analysed to profer a lasting solution to these challenges and help nurture existing useR communities as well as new ones. All this will be capped with advice on how I have worked with more than 7 growing communities in the last 5 years.
10:25	Biostatistics & epidemiology	Romane Poinsot	A Shiny Webapp for nutritional reformulation of food products according to French front-of-pack “Nutri-Score” label.	Toby Dylan Hocking	Ariane 1+2
In France, over half of processed food consumed on daily basis is made by food industry. Dietary behaviors play a key role in development of chronic diseases. In order to help people make healthier choices, French government, among others actions, implemented Nutri-Score label, on a voluntary basis. This scoring adapted from Food Standard Agency score ranges from A (green, best score) to E (red, worst score), taking into account contents of 7 favorable and unfavorable common components in food. We developed a web-based Shiny app to help agribusiness professionals calculate and improve the Nutri-Score of their products. A first feature consists in loading user's product nutritional composition table and automatically calculating score for all products. Using ggplot2 and ggiraph, the user visualizes score repartition in all her/his products database, filtering according to her/his own variables. Officer allows user to pick up desired graphs for automatic report. Then, from a product and score target (from A to E) selected by user, all combinations of components (expressed as contents ranges) complying with target are generated. Only the closest combinations to the initial nutritional composition are displayed to user. She/he can also choose varying components. In this way, user gets conceivable solutions to improve nutritional quality of her/him products by saving her/him valuable time.
10:30	Biostatistics & epidemiology	Fiona Grimm	Using Shiny to track winter pressures in the UK National Health Service (NHS)	Toby Dylan Hocking	Ariane 1+2
The NHS in England is under considerable pressure during winter. Within the context of existing funding pressures, demand for hospital care increasingly exceeds the capacity of emergency departments. In recent years, performance targets have consistently been missed on a national level with potentially worrying consequences for care quality and safety. This trend has also received growing media and political attention. Throughout winter the NHS regularly releases provider-level data on performance indicators, such as A&E waiting times and hospital bed occupancy, which are key to understanding quality of care and to inform future planning efforts. However, partly due to inconvenient formatting of the spreadsheets, it takes considerable analytical skill and effort to routinely produce aggregate metrics, examine trends and assess regional variation. We have developed a Shiny app as an interface for visualisation and comparison of a range of NHS performance indicators over winter, aimed at the public, the media and NHS analysts (to be released before useR). The app also shows historical context and the option to aggregate indicators within local areas. With this we aim to provide a convenient, consistent and consolidated way of tracking NHS winter performance indicators. We also want use it as a case study and learning resource to promote the use of R within the NHS via the NHS-R community.
10:35	Biostatistics & epidemiology	Thomas Petzoldt	antibioticR: An R package to identify resistant populations in environmental bacteria	Toby Dylan Hocking	Ariane 1+2
Antibacterial agents have made modern medicine possible. However, the dramatic increase of resistant and multiresistant bacteria is now recognized as a global challenge for human health. Phenotypic resistance can be measured in growth experiments where bacterial isolates are cultivated under drug exposure. This can be done in liquid media on multiwell plates to identify minimum inhibitory concentrations (MIC), or as diffusion test on an agar dish, where the diameter of the inhibition zone (ZD) is recorded. This is repeated with a large number of strains, because environmental populations are composed of different geno- and phenotypes. The MIC or ZD values form multi-modal distribution mixtures. Package antibioticR (https://github.com/tpetzoldt/antibioticR) implements methods to separate sub-populations from environmental samples and to estimate distribution parameters and quantiles. It provides: 1. Kernel density smoothing to estimate location parameters and an initial guess of variance, 2. A web-based (shiny) implementation of the ECOFFinder algorithm (Turnidge et al, 2006), 3. Maximum likelihood estimation of multi-modal normal and exponential-normal mixtures. The package analyzes sensitivity, tolerance and resistance on a sub-acute level to compare populations of different origin. The package contains also visualization tools and interactive web-applications.
10:40	Biostatistics & epidemiology	Daniela Mariosa	MR studies in R: how to use genetic information for identifying modifiable risk factors	Toby Dylan Hocking	Ariane 1+2
Mendelian randomization (MR) is a powerful approach to study causality by using germline genetic variants associated with a risk factor (exposure) as instrumental variables for the risk factor itself. The growing availability of results from large genome-wide association studies, not only for clinical outcomes but also for lifestyle exposures, makes MR analyses based on summary genetic data relevant for many exposure-outcome relationships. Properly conducting MR studies in R requires several steps that involve packages for both estimation and data visualization. We will present how to best exploit R to perform and present two-sample MR analyses with an example of our work on the role of obesity-related factors in cancer risk. The approach include the harmonization of the genetic information for the exposure and outcome, the estimation of the causal effect of the exposure on the outcome using different estimators, and a number of complementary analyses for the evaluation of potential pleiotropy, heterogeneity, assumption violations, and bias. The wide range of techniques required to conduct a robust MR analysis is reflected in the use of both largely used and MR-specific packages.
10:45	Biostatistics & epidemiology	Volha Tryputsen	Streamlining complex analyses of in-vivo data with INVIVOLDA shiny application	Toby Dylan Hocking	Ariane 1+2
In vivo studies are crucial to the development of novel therapies. In vivo data is important for proof-of-concept validation, FDA applications and clinical trials. Appropriate data analysis and interpretation are essential in providing the knowledge about the drug efficacy and safety within a living organism. With drug discovery science moving forward at an ever-accelerating rate analyses software not always capable to offer appropriate analysis suit. In vivo scientists at Janssen R&D needed comprehensive analysis tool to conduct appropriate and efficient analyses of in vivo data to insure quality and speed of decision-making. INVIVOLDA shiny application was developed to fulfill the gap. INVIVOLDA offers: powerful Linear Mixed Effect modeling for evaluating differences between treatments over time; Buckley-James method is used to infer about mean treatment differences when response is non-linear in the presence of censoring; survival analysis enables evaluation of time-to-event data. Furthermore, interactive and animated graphics allow users to conduct independent and thorough data explorations. INVIVOLDA allows streamlining complex statistical analyses of in-vivo longitudinal data by utilizing modern graphics, appropriate modelling techniques and report generation and insures efficient, traceable and reproducible in vivo research and data-driven decision making.
10:50	Biostatistics & epidemiology	Aritz Adin	A shiny web application for disease mapping. Making easy the fit of spatio-temporal models.	Toby Dylan Hocking	Ariane 1+2
Spatial and spatio-temporal analyses of count data are crucial in epidemiology and other fields to provide accurate estimates of mortality and/or incidence risks, and unveil the underlying spatial and spatio-temporal patterns. However, fitting spatial and spatio-temporal models is not easy for non-expert users. Here, we present the interactive web application SSTCDapp for the analysis of spatial and spatio-temporal mortality (or incidence) data, which is addressed at https://emi-sstcdapp.unavarra.es/. The web application is designed to perform descriptive analyses in space and time of mortality risks or rates, and to fit an extensive range of fairly complex spatio-temporal models commonly used in disease mapping. The application is built with the R package shiny and relies on the well founded integrated nested Laplace (INLA) approximation technique for model fitting and inference. Unlike other software used in disease mapping, SSTCDapp provides an user-friendly interface that facilitates the fit of complex statistical models to non-experts users without the need of installing any software in their own computers, since all the analyses and computations are made in a powerful remote server. In addition, a desktop version is also available to run the application locally if needed, which avoids uploading the data to the online application fully guaranteeing data confidentiality.
11:30	Data mining	Eric Lecoutre	Machine Learning with R: do it with a framework	Sigrid Keydana	Cassiopée
There is no doubt that R is the swiss knife for modeling activities. Variety of families of models and implementations in numerous packages talks for itself. Interested R users will read the Machine Learning task view and discover (some of) those packages. This talk is not about models themselves but about frameworks. Some meta-packages indeed do not at all contain modeling functions but act as wrappers around existing assets and provide high level functionalities (cross-validation, hyperparameters grid search, staking...). Objective is to have consistent syntax with a limited set of 'verbs' as well as providing a way to implement a modeling process. We will introduce the reasons why such packages are so interesting for Data Scientists and present 3 solutions: caret, mlr and SuperLearner. In addition, as modeling pipe also requires data preparation, we will also talk about vtreat, recipes (+embed), mlCPO, sl3 and their possible integration with those frameworks. We will present some modeling flows as example and also introduce a tidy approach for modeling .
11:48	Data mining	Erin Ledell	Building and Benchmarking Automatic Machine Learning Systems	Sigrid Keydana	Cassiopée
This talk will provide a brief overview of the field of Automatic Machine Learning (AutoML), with a focus on software and benchmarking. The term "AutoML" refers to automated methods for model selection and/or hyperparameter optimization and includes such techniques as automated stacking (ensembles), neural architecture search, pipeline optimization and feature engineering. AutoML tools are designed to maximize ease-of-use by simplifying the API. We will discuss the common AutoML software design patterns and take a detailed look at the AutoML algorithm inside of the "h2o" R package. An important part of the development process to evolve and improve an AutoML system is a comprehensive benchmarking strategy. Benchmarking AutoML software across a wide variety of datasets allows the algorithm designer to identify weaknesses in the algorithm and software. This enables tool designers to make incremental, measurable improvements to their system over time. We will present a new open source platform for benchmarking AutoML systems which is part of the OpenML.org ecosystem for reproducible research in machine learning. The system is extensible, so anyone can write a wrapper for their software in order to benchmark it against the most popular open source AutoML systems. We will also present benchmarking results for H2O AutoML against a variety of (mostly Python-based) AutoML systems.
12:06	Data mining	Michel Lang	mlr3: A new modular framework for machine learning with R	Sigrid Keydana	Cassiopée
The package mlr (Machine Learning with R) was released to CRAN in 2013, and its core design dates back even further. The new mlr3 package is its modular, from-scratch reimplementation in R6. Data is stored primarily as data.tables. mlr3 relies heavily on the reference semantics of R6 and data.table, which enable efficient and elegant programming on the provided machine learning building blocks. The package is geared towards scalability and larger datasets by natively supporting parallelization and out-of-memory data-backends like databases. With a clear object-oriented design, mlr3 focuses on core computational operations, while add-on packages allow seamless integration of extended functionality. For example, mlr3survival implements tasks and learners for survival analysis, and mlr3pipelines extends mlr3 with graphs of (pre)-processing operations, which can be jointly tuned with the mlr3tuning package. Project page: https://mlr3.mlr-org.com
12:24	Data mining	Bernd Bischl	mlr3pipelines: Machine Learning Pipelines as Graphs	Sigrid Keydana	Cassiopée
mlr3pipelines is an object-oriented dataflow programming toolkit for machine learning in R6. It provides an expressive and intuitive language to define ML workflows as directed acyclic graphs that represent data flows between computational units, e.g., preprocessing, model fitting and model combination. This chains data and model manipulation steps in a modular way to form powerful data processing pipelines. Many complex ML concepts, for which special purpose packages are usually provided, can now be expressed in few lines of graph definition code: e.g., unions of feature views, bagging, stacking and hurdle models. Resulting pipelines are parameterized, so all components can jointly be tuned to obtain an optimal configuration. Graphs can contain "branching" nodes which allow selective, conditional processing of execution paths. The tuning of such tasks allows complex model selection. The modular, object-oriented concept of mlr3pipelines facilitates convenient extension with custom operations, while the compatibility with mlr3 allows convenient tuning, benchmarking, nested resampling and more. Project page: https://github.com/mlr-org/mlr3pipelines
11:30	Programming 1	Scott Chamberlain	HTTP Requests For R Users and Package Developers	Ildiko Czeller	Saint-Exupéry
Many R users request data from the web in their scripts and packages. This talk introduces a modern suite of packages for managing HTTP requests. The crul package is a modern HTTP request library, including asynchronous requests, automatic handling of pagination, and more. Importantly, crul provides an R6 based object system that makes it easier to program with relative to other tools. The webmockr package mocks HTTP requests, returning user specified mocked responses matching the format of the real thing. The vcr package leverages webmockr to cache HTTP requests and responses. Both webmockr and vcr support many HTTP libraries. Last, httpcode provides information on all HTTP codes, and fauxpas provides proper HTTP error classes for use in most HTTP R libraries. These tools together provide a modern way for R programmers to manage HTTP requests.
11:48	Programming 1	Colin Gillespie	R and security	Ildiko Czeller	Saint-Exupéry
Data science using R is increasing performed in the cloud or over a network. But how secure is this process? In this talk, we won't look at complex hacking but instead, focus on the relatively easy hacks that can be performed to access systems. We'll use three R related examples of how it is possible to access a users system. In the first example, we'll investigate domain squatting on the Bioconductor website. By registering only thirteen domains, we had the potential to run arbitrary on hundreds of users. In the second example, we'll look at techniques for guessing passwords on RStudio server instances. Lastly, we'll highlight how users can be a little too trusting when running R code from blogs.
12:06	Programming 1	Jennifer Bryan	DRY out your workflow with the usethis package	Ildiko Czeller	Saint-Exupéry
Usethis is one of the packages created in the recent "conscious uncoupling" of the devtools package. Devtools is an established package that facilitates various aspects of package development. Never fear: devtools is alive and well and remains the public face of this functionality, but it has recently been split into a handful of more focused packages, under the hood. Usethis now holds functionality related to package and project setup. I'll explain the "conscious uncoupling" of devtools and describe the current features of usethis specifically. The DRY concept -- don't repeat yourself -- is well accepted as a best practice for code and it's an equally effective way to approach your development workflow. The usethis package offers functions that enact key steps of the package development process in a programmatic and documented way. This is an attractive alternative to doing everything by hand or, more realistically, copying and modifying files from one of your other packages. Usethis helps with initial setup and also with the sequential addition of features, such as specific dependencies (e.g. Rcpp, the pipe, the tidy eval toolkit) or practices (e.g. version control, testing, continuous integration).
12:24	Programming 1	Lionel Henry	Reusing tidyverse code, the easy way	Ildiko Czeller	Saint-Exupéry
In 2017 the tidyverse grammars were reimplemented on top of tidy evaluation, a metaprogramming framework from the rlang package. Tidy eval makes it possible to program flexibly and robustly with data masking functions from packages like dplyr or ggplot2. However, possible does not mean easy. Tidy eval has the major downside of requiring to learn new programming concepts and tools. In this talk, we'll focus on easier techniques to reuse code from tidyverse pipelines without such a steep learning curve: mapping columns, using fixed column names, passing dots, subsetting the `.data` pronoun, and interpolating expressions. These techniques will help you write functions around tidyverse pipelines and reduce code duplication in your scripts.
12:42	Programming 1	Davis Vaughan	Simple Arrays	Dirk Eddelbuettel	Saint-Exupéry
Within the tidyverse, the core structure that powers many packages is the tibble, a modern reimagining of the data frame. Unfortunately, with the large focus on data frames, the array has been left behind. The rray package is an attempt to change that. By borrowing ideas from tibble, rray hopes to create "simpler arrays" that are more predictable to use and program around. To accomplish this, rray provides the following new infrastructure: An rray class, which never drops dimensions while subsetting, and consistently retains dimension names where possible. Broadcasting semantics, using the xtensor library. rray implements the wildly popular idea of broadcasting, originally found in the Python library, numpy, to allow more intuitive and powerful operations between multiple rray objects. This opens up a much more complete set of operations than is currently possible with base R. A consistent toolkit for common array manipulation tasks, such as computing sums and products along any axis. Each function retains dimensionality by default, making it easy to link operations together through broadcasting. Importantly, this toolkit works with base R arrays as well as with the new rray objects. https://davisvaughan.github.io/rray/ https://github.com/DavisVaughan/rray
11:30	Models 2	Ghislain Vieilledent Jeanne Clément	Using Rcpp* packages for easy and fast Gibbs sampling MCMC from within R	Marco Scutari	Caravelle 2
Hierarchical Bayesian models are increasingly used for applied statistics. Parameters of such models can be estimated through Gibbs sampling Markov chain Monte Carlo using a variety of algorithms (conjugated priors, Metropolis-Hastings, Hamiltonian Monte Carlo). These algorithms approximate the parameter posterior distributions through iterative simulations and are computationally intensive. Using C/C++ language to code such algorithms make computations faster. In our presentation, we will show how Rcpp* R packages (Rcpp, RcppGSL and RcppArmadillo), can be easily used to (i) call Gibbs samplers written in C++ from within R, (ii) reduce computation time through efficient random draws, and (iii) facilitate vector and matrix operations. We will illustrate this approach with the new jSDM R package for fitting joint species distribution models.
11:48	Models 2	Elias Krainski	A toolbox for fitting non-separable space-time log-Gaussian Cox models using R-INLA	Marco Scutari	Caravelle 2
Many processes have space-time non-separable dynamics (e.g. disease spread and species distribution) which should be accounted for during modeling. A non-separable stochastic partial differential approach (SPDE) can be used to consider the realistic space-time evolution of the process in which the spatial and temporal autocorrelation in the latent field are linked (Krainski 2018). Observations of these processes are often measured as point-referenced locations in time, i.e. space-time point patterns. The log-Gaussian Cox process model is a popular class to model point patterns. However, it can be difficult fit in practice because the likelihood depends on an integral over the spatial and temporal domains. Implement these models in R-INLA is challenging because it involves several steps. We provide a step-by-step approach to constructing a space-time model in R-INLA using fox rabies as a case study. We discuss several useful updates to the R-INLA package (e.g. inlabru), including improvements to the integration methods in space and time and options to improve computational performance. We also discuss several practical considerations for users to consider in model construction including mesh generation, model implementation, model checking and extensions to the basic model. The goal of this work is to help users avoid common pitfalls when constructing and interpreting these models.
12:06	Models 2	Wei Jiang	Adaptive Bayesian SLOPE -- High-dimensional Model Selection with Missing Values	Marco Scutari	Caravelle 2
Model selection with high-dimensional data becomes an important issue in the last two decades. With the presence of missing data, only a few methods are available to select a model, and their performances are limited. We propose a novel approach -- Adaptive Bayesian SLOPE, as an extension of sorted $l_1$ regularization but in Bayesian framework, to perform parameter estimation and variable selection simultaneously in high-dimensional setting. This methodology in particular aims at controlling the False Discovery Rate (FDR). Meanwhile, we tackle the problem of missing data with a stochastic approximation EM algorithm. The proposed methodology is further illustrated by comprehensive simulation studies, in terms of power, FDR and bias of estimation.
12:24	Models 2	Raluca Gui	REndo: An R Package to Address Endogeneity Without External Instrumental Variables	Marco Scutari	Caravelle 2
Endogeneity becomes a challenge when aiming to uncover causal relationships in empirical research. Reasons are manifold, i.e. omitted variables, measurement error or simultaneity. These might lead to the unwanted correlation between the independent variables and the error term of a statistical model. While external instrumental variables methods can be used to control for endogeneity, these approaches require additional information which is usually difficult to obtain. Internal instrumental variable (IIV) methods address this issue by treating endogeneity without the need of additional variables, taking advantage of the structure of the data. Implementations of IIV are rare. Thereby, we propose the R package "REndo" that implements five instrument-free methods: the latent instrumental variables approach (Ebbes et al. 2005), the higher moments estimation (Lewbel 1997), the heteroskedastic error approach (Lewbel 2012), the joint estimation using copula (Park and Gupta 2012) and the multilevel GMM (Kim and Frees 2007). This talk will focus on both, the theory behind each of the five methods proposed as well as on the practical implementation using real and simulated data.
12:42	Models 2	Anne Helby Petersen	Discovering the cause: Tools for structure learning in R	Marco Scutari	Caravelle 2
Enormous amounts of observational data are being produced every day from internet users, health care providers and satellites alike. This opens up a lot of new possibilities for what observational data may be used for. But if the subject is causality, it is still common to solely rely on externally proposed, "hypothesis-driven" models, which limits the range of causal inquiries. However, in some cases it is possible to construct a causal model from the data using structure learning. This is not only relevant for those interested in inferring causality. Even when prediction is the goal, knowledge of causal structures is useful for helping domain adaption because the mechanistic nature of causal structures make them more stable. A myriad of packages for structure learning have been developed in R, including pcalg, bnstruct, bnlearn, deal, catnet and stablespec, each of them dedicated to a certain class of causal models (e.g. linear), a specific algorithmic approach (e.g. constraint-based), or a combination of both. In this presentation, I provide an overview of existing R packages for structure learning, focusing on overlaps and differences in functionality, interface and possibilities to include external information. I also discuss how the packages may be integrated into a joint tool, thereby facilitating structure learning without settling on a model class or learning approach a priori.
11:30	Forecasting	Mitchell O'hara-Wild	Flexible futures for fable functionality	Genaro Sucarrat	Ariane 1+2
The fable ecosystem provides a tidy interface for time series modelling and forecasting, leveraging the data structure from the tsibble package to support a more natural analysis of modern time series. fable is designed to forecast collections of related (possibly multivariate) time series, and to provide tools for working with multiple models. It emphasises density forecasting, whilst continuing to provide a simple user-interface for point forecasting. Existing implementations of time series models work well in isolation, however it has long-been known that ensembles of forecasts improve forecast accuracy. Hybrid forecasting (separately forecasting components of a time series) is another useful forecasting method. Both ensemble and hybrid forecasts can be expressed as forecast combinations. Recent enhancements to the fable framework now provide a flexible approach to easily combine and evaluate the forecasts from multiple models. The fable package is designed for extensibility, allowing for easier creation of new forecasting models and tools. Without any further implementation, extension models can leverage essential functionality including plotting, accuracy evaluation, model combination and diagnostics. This talk will feature recent developments to the fable framework for combining forecasts, and the performance gain will be evaluated using a set of related time series.
11:48	Forecasting	Thiyanga Talagala	Feature-based Time Series Forecasting	Genaro Sucarrat	Ariane 1+2
This work presents two feature-based forecasting algorithms for large-scale time series forecasting. The algorithms involve computing a range of features of the time series which are then used to select the forecasting model. The forecasting model selection process is carried out using a pre-trained classifier. In our first algorithm we use a random forest algorithm to train the classifier. We call this framework FFORMS (Feature-based FORecast Model Selection). The second algorithm use efficient Bayesian multivariate surface regression approach to estimate forecast error for each method, and then using the minimum predicted error to select a forecasting model. Both algorithms have been evaluated using time series from the M4 competition, and is shown to yield accurate forecasts comparable to several benchmarks and other commonly used automated approaches in the time series forecasting literature. The methods are made available in the seer and fformpp packages in R.
12:06	Forecasting	Benjamin Goehry Hui Yan	Random forests for time series	Genaro Sucarrat	Ariane 1+2
Random forests were introduced in 2001 by Breiman and have since become a popular learning algorithm, for both regression and classification. However, when dealing with time series, random forests do not integrate the time-dependent structure and treat each instant as an independent observation. In this study, we propose the rangerts, an extended version of the ranger package for time series. Goehry (2018) proved under right hypotheses on parameters and the time series that random forests are consistent. In practice, the idea is to replace the IID bootstrapping with dependent bootstrapping to subsample time series during the tree construction phase to take time dependency into account. We tested our package both on numerical simulations and the real world applications on electricity load and show our method improves forests' accuracy in some cases. We discuss also how to make a good choice of the key parameters. References Breiman L. Random forests, Machine Learning, vol. 45, no.1, pp. 5-32, 2001. Goehry B. Random forests for time-dependent processes, preprint, Available: https://hal.archives-ouvertes.fr/hal-01955331, 2018. Wright M.N. & Ziegler A. ranger: A fast implementation of random forests for high dimensional data in C++ and R, J Stat Softw 77:1-17, 2017.
12:24	Forecasting	Ivan Svetunkov	Smooth forecasting in R	Genaro Sucarrat	Ariane 1+2
There are several packages in R that implement forecasting using state space models, and only one that relies on a single source of error state space model ("forecast" package). Unfortunately, the forecasting functions in that package are not flexible enough for different research purposes. For example, exponential smoothing, implemented in ets() function does not allow using explanatory variables, setting the initial values of the states vector and does not allow fitting models to the data with periodicity higher than 24. This motivated the original development of the package back in 2015, with the main aim of making the research in forecasting area "smooth". Four years later, the smooth package has a handful of flexible forecasting functions useful for different research purposes, implementing ETS, ARIMA, vector exponential smoothing, simulation functions and more. In this presentation we will discuss the main functions of the package, show their advantages and disadvantages, and show how they can be applied for the solution of the real world forecasting problems and complement "forecast" and other widely used packages.
12:42	Forecasting	Eran Raviv	Forecast Combination in R	Genaro Sucarrat	Ariane 1+2
Introducing the R package ForecastComb. The aim is to provide researchers and practitioners with a comprehensive implementation of the most common ways in which forecasts can be combined. The package in its current version covers 15 popular estimation methods for creating a combined forecasts – including simple methods, regression-based methods, and eigenvector-based methods. It also includes useful tools to deal with common challenges of forecast combination (e.g., missing values in component forecasts, or multicollinearity), and to rationalize and visualize the combination results.
11:30	Communities & conferences	Dennis Irorere	R for Data Science Online Community	Hannah Frick	Concorde 1+2
The R for Data Science (R4DS) Online Learning Community was started by Jesse Mostipak in August of 2017, with the goal of creating a supportive and responsive online space for learners and mentors to come together to provide support and guidance in learning R in a beginner-friendly, safe and supportive environment. However, the main aim of the R for Data Science community was to move through the R for Data Science text by Garrett Grolemund and Hadley Wickham, which walks readers through the major features of the tidyverse in the context of data science. We also aim to help users learn R and expand their R knowledge. Over time R4DS science community has been developing projects intended to help connect mentors and learners. One of the first projects born out of this collaboration is #TidyTuesday, a weekly social data project focused on using tidyverse packages to clean, wrangle, tidy, and plot a new dataset every Tuesday. Over the years, the community have been experienced a steep growth with over 2000 members from various countries in the world in our slack channel. At the R4DS community, we encourage diversity and contribution from everyone. We believe that no question is silly and no one is an island of knowledge. In this talk, we will be sharing the positive changes made since the transfer of leadership, our challenges, the lesson we have learnt and provide room for suggestion and opinions.
11:48	Communities & conferences	Laura Acion	Insights from the recent R community development and growth in Latin America	Hannah Frick	Concorde 1+2
This talk summarizes some of the recent work that put Latin America (LatAm) in the international R user map. In January 2017, R-Ladies Buenos Aires (BA) was founded. R-Ladies BA encouraged the foundation of other R-Ladies groups in the region (e.g., Santa Rosa, Santiago). In two years, LatAm went from one to 29 R-Ladies active groups. English-only materials can be a barrier to R. Hence, LatAm R-Ladies led efforts to translate into Spanish: R for Data Science (es.r4ds.hadley.nz), R-Ladies code of conduct and policies (bit.ly/2UwP79x), and R Consortium 2017 (bit.ly/2Tpt8RR) and RStudio 2018 surveys. The creation of new regional conferences was another R-Ladies-led effort. Thanks to the R-Ladies network, LatinR (latin-r.com) got rapidly started and LatAm had its first SatRDay. Among other achievements, these events impulsed the regional community in the form of new and diverse R user groups (e.g., Rosario, Montevideo). In a nutshell, in the last two years, the LatAm R user community has seen fast growth and regrouping, mainly spearheaded by R-Ladies. This presentation details key aspects of this process and seeks to inspire other R regional communities worldwide.
12:06	Communities & conferences	Dennis Irorere Shel Kariuki	AfricaR	Hannah Frick	Concorde 1+2
Africa R is a consortium of passionate Africa R user groups and users innovating with R every day and are inspired to share their experience as well as communicate their findings to a global audience. This consortium was birth from the underrepresentation and minority involvement of African population in every role and area of participation, whether as R developers, conference speakers, educators, users, researchers, leaders and package maintainers. As a community, our mission is to achieve improved representation by encouraging, inspiring, and empowering African population of all genders who are underrepresented in the global R community. With a primary objective of supporting already existing R Users across Africa and R enthusiasts to embrace the full potential of R programming, through fostering a collaborative continental network of R gurus, mentors, learners, developers and leaders to help facilitate individual and collective progress worldwide. Africa R talk includes a presentation of our work plan, collaborators, partners and mentors. We will also be using this opportunity to show statistics of members, R user group in our network and launching our website. #AfricaRusers - Twitter handle.
12:24	Communities & conferences	Noa Tamir Colin Gillespie Riva Quiroga Vincent Warmerdam	The truth about satRdays (panel session, part 1)	Hannah Frick	Concorde 1+2
9 satRday events were organised since UseR! 2018. Join us for a panel discussoin with a diverse group of organisers from across the world. You will learn about why they chose to volunteer their time, how they approached satRday's mission, what they have learnt from organising the event, how satRday impacted their local R community, and more!
12:42	Communities & conferences	Noa Tamir Colin Gillespie Riva Quiroga Vincent Warmerdam	The truth about satRdays (panel session, part 2)	Hannah Frick	Concorde 1+2
9 satRday events were organised since UseR! 2018. Join us for a panel discussoin with a diverse group of organisers from across the world. You will learn about why they chose to volunteer their time, how they approached satRday's mission, what they have learnt from organising the event, how satRday impacted their local R community, and more!
11:30	Biostatistics & epidemiology 1	Thibaut Jombart	Reproducible data science to support outbreak responses: experience from the North Kivu Ebola outbreak	Rich Fitzjohn	Guillaumet 1+2
The response to emerging disease outbreaks and health emergencies are increasingly data-driven, integrating various sources of information to improve situational awareness in real time. Outbreak analytics face many of the modern data science challenges, and additional difficulties pertaining to the emergency, low-resource settings characterising some of these outbreaks. In this presentation, I will outline some of these challenges, and a range of solutions developed by the R Epidemics Consortium (RECON), an NGO dedicated to developing free analytics resources for health crises. In particular, we will discuss different aspects relating to the deployment of robust, reliable reporting infrastructures in the 2019 Ebola outbreak in North Kivu, DRC. We will showcase features of new R packages dedicated to data standardisation and cleaning (linelist), automated reporting (reportfactory), and offline alternatives to online repositories such as CRAN and github for deploying R-based analysis environments (RECON deployer). I will conclude on how R can help strengthening outbreak response capacities, and more generally humanitarian work, in low and middle income countries.
11:48	Biostatistics & epidemiology 1	Zhian Kamvar	Advancing data analytics for field epidemiologists using R: the R4epis innovation project	Rich Fitzjohn	Guillaumet 1+2
Data analysis is integral to informing operational elements of humanitarian medical responses. Field epidemiologists play a central role in informing such responses as they aim to rapidly collect, analyse and disseminate results to support Médecins Sans Frontières (MSF) and partners with timely and targeted intervention strategies. However, a lack of standardised analytical methods within MSF challenges this process. The R4epis project group consists of 18 professionals with expertise in: R programming, field epidemiology, data science, health information systems, geographic information systems, and public health. Between October 2018 and April 2019, R scripts were developed to address all aspects of data cleaning, data analysis, and automatic reporting for outbreaks (measles, meningitis, cholera and acute jaundice) and surveys (retrospective mortality, malnutrition and vaccination coverage). Analyses and outputs were piloted and validated by epidemiologists using historical data. The resulting templates were made available to field epidemiologists for field testing, which was conducted between February and April 2019. R4epis will contribute to the improvement of the quality, timeliness, consistency of data analyses and standardisation of outputs from field epidemiologists during emergency response.
12:06	Biostatistics & epidemiology 1	Vincent Audigier	micemd: a smart multiple imputation R package for missing multilevel data	Rich Fitzjohn	Guillaumet 1+2
Multiple imputation is a common strategy to overcome the missing data issue. Several MI methods have been proposed in the literature to impute multilevel data with classical sporadically missing values only. However, methods for dealing with more complex missing data are needed. Indeed, the multilevel structure is often due to data merging (because of the heterogeneity between collected datasets), but variables often vary according to the dataset, leading to systematically missing variables. micemd is an addon for the mice package to perform multiple imputation using chained equations with two-level data. It includes imputation methods specifically handling sporadically and systematically missing values (Resche-Rigon et al. 2013, Audigier, V. et al, 2018). micemd offers a complete solution for the analysis: the choice of the imputation model for each variable can be automatically tuned according to the data structure (Audigier, V. et al, 2018), it gathers tools for checking model fitting (Blackwell, M. et al, 2015) and allows parallel calculation. The talk is motivated by a meta-analysis in cardiovascular disease consisting of 28 observational cohorts in which systematically missing and sporadically missing data (GREAT data). Then, based on a simulation study, advantages and drawbacks of each multiple imputation method are discussed. Finally, methods are compared on the GREAT data.
12:24	Biostatistics & epidemiology 1	Iryna Schlackow	Facilitating external use with user-friendly interfaces: a health policy model case study	Rich Fitzjohn	Guillaumet 1+2
Health policy models are increasingly being used by clinicians, analysts and policy makers to predict patients' long-term outcomes, how these outcomes are affected by interventions, and whether the interventions are (cost-)effective and should be recommended for use. Usability of models as well as reliability and transparency of methods and results is therefore vital. Even when the code is freely available, laws preventing the sharing of sensitive individual patient data mean that the published results are still not fully reproducible. Additionally, model users may want to change input parameters, and therefore need to possess skills, and time, to understand the underlying code. We present a Shiny-based user-friendly web interface for a health policy model predicting progression of chronic kidney disease and cardiovascular complications. The interface is freely available and users can change different parameters using drop-down menus or .csv files; the output is a detailed downloadable .csv file, and a userguide is provided together with a range of templates. We discuss how, in addition to aid with the usability, such interface may help with debugging and transparency, and what the key considerations during the development are.
12:42	Biostatistics & epidemiology 1	Torben Tvedebrink	genogeographer - a tool for ancestry informative markers	Rich Fitzjohn	Guillaumet 1+2
Ancestry informative markers (AIMs) are genetic markers that give information about the genogeographic ancestry of individuals. They are for example used to predict the genogeographic origin of individuals related to forensic crime and identification cases. A likelihood ratio test (LRT) approach is derived in order to prevent erroneous conclusions due to e.g. no relevant populations in a database of reference populations. The LRT is a measure of absolute concordance between a profile and a population, rather than a relative measure of the profile's likelihood in two populations by a likelihood ratio (LR). The LRT is similar to Fisher's exact test and constitutes an outlier test analogous to studentized residuals. Varying sample sizes of the reference populations in the database are explicitly included in the LRT with fewer assumptions than the LR. The LRT is extended to handle admixed profiles (parents of different genogeographic origin). The methodology has been implemented in the R package genogeographer with an optional shiny front-end, that enables forensic geneticists to make explorative analyses, produce various graphical outputs together with evidential weight computations.
14:00	Operations & data products	Francois Michonneau	How a non-profit uses R for its daily operations	Colin Gillespie	Concorde 1+2
The Carpentries is a non-profit organization that organizes a global community of scientists that runs workshops where researchers are taught foundational computational and data science skills. Since 2012, The Carpentries has taught 2,000+ workshops in 46 countries reaching 38,000+ learners. Here, I will present how The Carpentries uses R in its daily operations. From analyzing our survey results, to sending personalized certificates of workshop attendance to learners, from creating live-updating visualizations for our workshop instructors to understand their audience before the workshops, to our lesson templates, I will go through some examples of how R has become a central part of how we set up our systems to manage and automate our workflows within our organization. The combination of literate programming to generate reports, web application development with shiny, and the availability of packages to interact with the web API for many of the online tools and services we use, has allowed us to develop custom workflows. We can iterate quickly and deploy using continuous integration services and Docker. This talk will be of interest to organizations looking to automate their operations to demonstrate how R can successfully be used in production.
14:18	Operations & data products	Daan Seynaeve	rjenkins and rrundeck: Coordinating Continuous Integration and Delivery with R	Colin Gillespie	Concorde 1+2
Continuous integration is a software development practice that advocates for members of a team to merge their work frequently in a shared repository. Each integration is verified through a process of automated building and testing. This process can be facilitated through the use of a build server. A popular choice is Jenkins: a free and open source automation server. Continuous delivery is an extension of the continuous integration principle that asks that every successfully built version of the software is put into a staging environment from where it can easily be deployed to a production setting. The deployment process is still manual but it can be simplified through the use of self-service operations. Rundeck is an open source management platform that allows to define such self-service operations. We propose two new R packages: rjenkins and rrundeck. These packages interact with the Web APIs offered by Jenkins and Rundeck to easily create, trigger and monitor jobs and operations. This provides R users with an intuitive interface to these tools which can be used in an interactive or scripted way.
14:36	Operations & data products	Kelly Obriant	Advanced Git Integrations for Automating the Delivery of Reproducible Data Products in R	Colin Gillespie	Concorde 1+2
We know that adopting git or code version control mechanisms are important for creating a culture of reproducibility in data science. But once you've established the basic best practices in your daily workflow, what comes next? This talk will be an introduction to continuous integration and continuous delivery tools (CI/CD). I'll cover reasons why CI/CD tools can enhance reproducibility for R and data science, showcase practical examples like automated testing and push-based application deployment, and point to simple resources for getting started with these tools in a git and GitHub based environment. The target user base for advanced Git and GitHub integration tooling remains focused on software engineers and IT professionals. As data scientists lead the scientific community as a whole toward better, reproducible research practices, we need to be aware of the vast ecosystem of technology solutions that can benefit this mission. The specific tools I'll aim to cover in this short presentation are: GitHub webhooks, Jenkins, and Travis, all framed in terms of their use with R code and data products.
14:54	Operations & data products	Verena Held Max Held	GitHub actions for R	Colin Gillespie	Concorde 1+2
Continuous integration and delivery (CI/CD) has evolved as a software development best practice, and it also strengthens reproducibility in (data) science. GitHub actions is a new workflow automation feature of the popular code repository host GitHub. It is a convenient service layer on top of the popular container standard docker, and is itself partly open source, thus limiting vendor lock-in. GitHub actions may offer better CI/CD for the R community, but most importantly, it is simple to reason about if things go wrong. The ghactions project presented here offers three avenues to bring GitHub actions to the R community: 1. Developing and curating actions to run R-specific jobs on GitHub, including arbitrary R code or deploying to shinyapps.io. 2. Furnishing users with some out-of-the-box workflows for different kinds of R projects. 3. Documenting experiences and evolving best practices for how to make the most of GitHub actions for R. More information on the ghactions package and project can be found at: http://maxheld.de/ghactions/.
14:00	Programming 2	Tomas Kalibera	Sustainable Package Development	Dirk Eddelbuettel	Cassiopée
Writing and maintaining packages is an essential contribution to the R community. Despite a number of formal requirements on packages, most of the internal details of the language can be inspected and modified through reflective features and a rich C API, giving a lot of freedom to package developers. This probably contributed to the popularity of R, but poses a risk when not used responsibly. R needs to adapt to the changing needs of its users and to changes in the software/hardware environments in which it is used. As of today, almost any change in the R runtime, however minute, breaks some packages. The causes are hard to find especially when the change is to undocumented R behavior or when it "wakes up" an old bug. Tests using all CRAN/BIOC packages are run routinely, requiring expensive hardware. Debugging requires skill, experience, knowledge of R internals and typically much more time than the implementation of the change that caused the bug. It is done by R Core and adds to the workload of repository maintainers. This talk is an inspiration for package authors who want to develop packages responsibly, without unnecessarily increasing the cost of maintenance and evolution of R. It will include advice and examples based on my work on the R runtime (and debugging of packages) related to PROTECT bugs, in-place modification of immutable values, memory management at the C/C++ boundary, and parse data fixes.
14:18	Programming 2	Elie Canonici Merle	Typing R	Dirk Eddelbuettel	Cassiopée
For a long time now, programming langages have been divided in two categories, dynamically typed ones and statically typed ones.Both sides tend to argue that their system has more inherent benefits than drawbacks according to their needs. On one hand it is convenient to have a system that allows writing non well typed programs that you know to be correct. On the other hand, no programmer is ever safe from a silly mistake taking ages to be fixed, if ever detected, in his code base. Thus, with the ever growing usage of dynamically typed langages such as R or javascript, it has become increasingly important to detect mistakes as early as possible in the deveopment process. By adapting some approaches inherited from the strongly statically typed langages community we have developped a typing system for a fragment of the R programming language. We argue that it does not restrict the expresiveness of the R language beyond what is actually widely used. Moreover we have embedded a type checker in a state of the art integrated development environment leveraging the graphical interface to report useful errors to the user.
14:36	Programming 2	Perry De Valpine	nCompiler: C++ code-generation from R code	Dirk Eddelbuettel	Cassiopée
Many package developers boost performance by coding key steps in C++, using R's C headers and/or Rcpp. The nimble package, predecessor to nCompiler, includes a system for automatic generation of C++ for a core subset of R's math and distribution functions. nimble implements vectorized and recycling-rule operations by code-generating to the Eigen C++ library and automatic differentiation via the CppAD library (in development versions). It includes basic flow control and static data types. However, as a general R programming tool, nimble has design limitations. nCompiler is a new package, designed to be a more general programming tool, with some refactored components from nimble. nCompiler allows definition of classes that mix R and C++ (code-generated or embedded) data and methods as well as pure functions. Much numerical work becomes C++ without coding any C++ by hand. nCompiler plays well with Rcpp and harnesses its compilation tools; harnesses Eigen more deeply, including its Tensor features; supports automatic differentiation via CppAD; and is designed for embedding code in packages, parallelizing, serializing C++ objects, and providing a natural workflow.
14:54	Programming 2	Zbynek Slajchrt	Mixed interactive debugging of R and native code with FastR and Vistual Studio Code	Dirk Eddelbuettel	Cassiopée
Interactive debuggers are one of the most useful tools to aid software development. The two most used approaches for interactive debugging in the R ecosystem, the built-in debugger and R Studio, do not support interactive debugging of both R and C code of R packages at the same time and in one tool. FastR is an open source alternative R implementation that, apart from being compatible with GNU-R, puts emphasis on the performance of R execution and tooling support. FastR is part of the multilingual virtual machine (GraalVM) that, among other things, provides language agnostic support for interactive debugging. One of the other projects built on top of GraalVM is a C/C++ interpreter. FastR can be configured to run the native code of selected R packages using this C/C++ interpreter, which should yield the same behavior, but since both languages are now running in one system, it opens up many exciting possibilities including seamless cross-language debugging. In the talk, we will demonstrate how to configure Visual Studio Code and FastR for cross-language interactive debugging and how to debug a sample R package with native code.
14:00	Numerical methods	Alessandro Gasparini	Analysing results from Monte Carlo simulation studies using the rsimsum package and the INTEREST shiny app	Thomas Petzoldt	Caravelle 2
Monte Carlo simulation studies are computer experiments that involve generating data by pseudo-random sampling; they provide an invaluable tool for statistical and biostatistical research. Consequently, dissemination of results plays a focal role to drive adoption and further development of new methods. However, simulation studies are often poorly designed, analysed, reported. One of the aspects often poorly reported - often not reported at all - is the Monte Carlo error of summary statistics, defined as the standard deviation of the estimated quantity over replications. Monte Carlo errors play a crucial role in understanding how results are affected by chance. In order to aid researchers interested in running and analysing simulation studies, we developed rsimsum and INTEREST. rsimsum is an R package for analysing results from simulation studies: it computes the most common summary statistics and Monte Carlo errors are reported by default. INTEREST is a shiny app providing an interface to rsimsum, offering tools to analyse simulation studies interactively and export plots and tables of summary statistics for later use. rsimsum and INTEREST can aid investigating results from simulation studies and supplement the reporting of their results to a great extent, allowing researches to share the full results of their simulation study and readers to explore them freely and in a more engaging way.
14:18	Numerical methods	Robert Crouchley	Algorithmic Differentiation in R using the RcppEigenAD package	Thomas Petzoldt	Caravelle 2
Algorithmic Differentiation (AD) has been available as a programming tool for statisticians for a number of decades. However, its adoption as an alternative to symbolic and numeric methods does not seem to be very common. One possible reason for this is the difficulty that is typically encountered when attempting to integrate AD functionality into existing statistical computing environments. The RcppEigenAD package attempts to mitigate these difficulties when employing AD within R by combining the facilities of the Rcpp package for extending R with C++, the AD library cppAD, and the Eigen linear algebra library, into a single R package. The resulting package, (RcppEigenAD), allows the user to define matrix valued functions of matrix arguments in c++ and seamlessly integrate them into an R session in a way that allows not only for computing the function but also their Jacobian and Hessian. The package also includes an implementation of Faa di Bruno's formula for calculating the partial derivatives of the composition of functions defined by the user. The package has application in the areas of optimisation, sensitivity analysis and calculating covariances via the delta method, which are illustrated with examples.
14:36	Numerical methods	Richard Fitzjohn	Describing and solving differential equations with a new domain specific language, odin	Thomas Petzoldt	Caravelle 2
Solving differential equations in R presents a challenge because one must choose between implementations that are either expressive but slow to compute, or cumbersome to write but faster. I present a new package "odin" for removing this tradeoff by creating a domain specific language (DSL) hosted in R that compiles a subset of R to C for efficiently expressing and solving differential equations (JavaScript and R are compilation targets under current development). By treating the set of differential equations as a directed acyclic graph, the DSL is declarative rather than imperative. "odin" leverages the well established "deSolve" package to interface with a number of well understood solvers. I present applications of "odin" both in a research context for implementing epidemiological models with 10's of thousands of equations and in teaching contexts where we are using the introspection built into the DSL to automatically generate shiny interfaces.
14:00	Spatial data & maps	Emmanuel Blondel	Strengthening of R in support of spatial data infrastructures management: geometa and ows4R packages	Thibault Laurent	Ariane 1+2
The amount of data to manage is increasing across institutions. Metadata plays a key role to make this data findable, accessible, interoperable and re-usable, and becomes a pillar through legal frames (eg INSPIRE) or with the emergence of data management plans (DMPs). Data managers have thus to deal with these requirements by applying standards to manage (meta)data formats and access protocols. It is especially the case for spatial information which is ruled by ISO/OGC standards. The use of R has been spreading worldwide as preferred tool for data managers and scientists. In this context, some projects were initiated to support metadata handling for specific domains (eg EML), while the capacity to produce standardized ISO/OGC geographic metadata with R was limited. The geometa and ows4R packages aim to fill this gap by providing functions to write and read ISO/OGC metadata, and interfaces to the OGC Web Services. We explain their functioning, including recent features provided with the support of the R Consortium. We present then how they contribute to several national and international information systems in different domains such as fisheries, marine monitoring, ecology and earth observation. Based on these packages, the geoflow initiative as orchestrator for spatial data management in R will be introduced to demonstrate how R can be used for managing spatial data infrastructures.
14:18	Spatial data & maps	Ege Rubak	Resample-smoothing of Voronoi intensity estimators	Thibault Laurent	Ariane 1+2
Voronoi estimators are non-parametric and adaptive estimators of the intensity of a point process. The intensity estimate at a given location is equal to the reciprocal of the size of the Voronoi/Dirichlet cell containing that location. Their major drawback is that they tend to paradoxically under-smooth the data in regions where the point density of the observed point pattern is high, and over-smooth where the point density is low. To remedy this behaviour, we propose to apply an additional smoothing operation to the Voronoi estimator, based on resampling the point pattern by independent random thinning. Through a simulation study we show that our resample-smoothing technique improves the estimation substantially. The proposed intensity estimation scheme is also applied to two datasets: locations of pine saplings (planar point pattern) and motor vehicle traffic accidents (linear network point pattern). Everything is implemented in R and released in the `spatstat` package available on CRAN. The oral presentation will explain the basic concepts, which are very simple, and leave out the mathematical details. Instead, focus is on the relevant objects and classes in the R implementation and how e.g. the Voronoi/Dirichlet tesselation is handled on a linear network such as a road map.
14:36	Spatial data & maps	Timothée Giraud	Thematic mapping with "cartography"	Thibault Laurent	Ariane 1+2
The R spatial ecosystem is blooming and dealing with spatial objects and spatial computations has never been so easy. In this context, the cartography package aim is to create thematic maps with the visual quality of those designed with classical mapping or GIS tools. The package helps to design cartographic representations such as proportional symbols, choropleth, typology, flows or discontinuities maps. It also offers several features that improve the graphic presentation of maps, for instance, map palettes, layout elements (scale, north arrow, title...), labels or legends. cartography is a mature package (first release in 2015), it has already been reviewed in both software and cartography focused journals (Giraud, Lambert 2016 & Giraud, Lambert 2017). It follows current good practices by using continuous integration and a test suite. A vignette, a cheat sheet and a companion website help new users to start using the package. In this presentation we will firstly give an overview of the package main features. Then we will develop examples of use of the package along with other spatial related packages. Giraud, T., & Lambert, N. (2016). cartography: Create and Integrate Maps in your R Workflow. The Journal of Open Source Software, 1(4), 1-2. Giraud, T., & Lambert, N. (2017, July). Reproducible cartography. In International Cartographic Conference (pp. 173-183). Springer, Cham.
14:54	Spatial data & maps	Edwin De Jonge	Creating privacy protecting density maps: sdcSpatial	Thibault Laurent	Ariane 1+2
R allows for creating beautiful maps and many examples can be found. Cartography is an indispensable tool in analyzing spatial data and communicating regional patterns to your target audience. Current data sources often contain location data making maps a natural visualisation choice. While these detailed location data are fine for analytic purposes, derived statistiscs have the risk of disclosing information of individual persons. For example if one plots the spatial distribution of income or getting social welfare, sparsely populated areas may be very revealing. Statistical procedures to control disclosure have been readily available (sdcTable, sdcMicro), but those focus on protecting tabulated or microdata, not on protecting spatial data as such. R package sdcSpatial (https://github.com/edwindj/sdcSpatial) allows for creating maps that show spatial patterns but at the same time protect the privacy of the target population. The package also contains methods for assessing the associated risk for a given data set.
14:00	Visualisation	Achim Zeileis	colorspace: A Toolbox for Manipulating and Assessing Color Palettes	William Chase	Saint-Exupéry
The R package "colorspace" (http://colorspace.R-Forge.R-project.org/) provides a flexible toolbox for selecting individual colors or color palettes, manipulating these colors, and employing them in statistical graphics and data visualizations. In particular, the package provides a broad range of color palettes based on the HCL (Hue-Chroma-Luminance) color space. The three HCL dimensions have been shown to match those of the human visual system very well, thus facilitating intuitive selection of color palettes through trajectories in this space. Namely, general strategies for three types of palettes are provided: (1) Qualitative for coding categorical information, i.e., where no particular ordering of categories is available and every color should receive the same perceptual weight. (2) Sequential for coding ordered/numeric information, i.e., going from high to low (or vice versa). (3) Diverging for coding ordered/numeric information around a central neutral value, i.e., where colors diverge from neutral to two extremes. To aid selection and application of these palettes the package provides scales for use with ggplot2; shiny (and tcltk) apps for interactive exploration (see also http://hclwizard.org/); visualizations of palette properties; accompanying manipulation utilities (like desaturation and lighten/darken), and emulation of color vision deficiencies.
14:18	Visualisation	Ian Lyttle	Vegawidget: Composing and Rendering Interactive Vega(-Lite) Charts	William Chase	Saint-Exupéry
Vega-Lite, alongside Vega, is a JavaScript implementation of an interactive grammar-of-graphics, developed by the Interactive Data Lab at the University of Washington. You build chart specifications using JSON; Vega(-Lite) renders your specifications as charts in your browser. The vegawidget package (on CRAN) is an htmlwidgets interface to Vega(-Lite), letting you compose specifications using R lists. The package offers functions to help build chart specifications and to render them as htmlwidgets. It also offers functions to define interactivity between Shiny and Vega(-Lite) via datasets, events, and signals (reactive variables). You can also define interactivity using JavaScript in an R Markdown document. Although Vega-lite offers an interactive grammar-of-graphics, this package offers a low-level interface for composing chart-specifications. As a result, vegawidget is designed to be extensible, making it easier to develop higher-level, user-friendly packages to build specific types of charts, or even to build a general ggplot2-like framework, using vegawidget as a foundation. Package website: https://vegawidget.github.io/vegawidget
14:36	Visualisation	Ursula Laa	Visualising high-dimensional data: new developments of the tourr package using Shiny and plotly	William Chase	Saint-Exupéry
The tour is a tool for the visualisation of multi-dimensional structures by means of dynamic projections, implemented the R package tourr (Wickham et al., 2011). Availability in R means that we can readily extend it with features from other packages. In this talk I will show how we can use Shiny and plotly to create a graphical interface, enhancing usability for non-experts and allowing for interactive features like linked brushing in the tour display, stopping and restarting with new settings, or hover text information on data points. In addition I will show how index functions scoring the "interestingness" of 2D projections, available in various R packages, can be combined with the guided tour, steering projections towards more interesting views of the data.
14:54	Visualisation	Jim Harner	xstatR: an Environment for Running R and XLISP-STAT in Docker Containers	William Chase	Saint-Exupéry
R is the lingua franca of statistical computing, but it lacks a strong API for interactive, dynamic graphics. Shiny and ggvis are excellent, but they do not support dynamic actions (brushing, geodesic rotations, etc.). XLISP-STAT and R were the dominant computing and graphics platforms in the nineties and early 2000s, but it became clear that only a single open source platform was viable and the community chose R. However, it is now common to use multiple platforms, e.g., R and Spark or R and Python. xstatR in an environment that combines R and XLISP-STAT within a single Docker container (https://github.com/jharner/xstatR). R and XLISP-STAT run as separate Linux processes in a container with a bridge between them, which allows saved R objects to be translated into Lisp objects (or vice versa). Models typically are built in R and exploratory and diagnostic dynamic plots are created in XLISP-STAT. Since XLISP-STAT also supports windows, menus, etc., a graphical interface has been developed which obviates the need for users to learn Lisp. Prototype XLISP-STAT packages have been built for model-based interactive diagnostic plots, multivariate visualizations, and GGobi-type dynamic graphs. The xstatR container can be deployed locally or to any cloud service, e.g., AWS. We are refactoring xstatR to run XLISP-STAT as a subprocess of R, which will allow XLISP-STAT to more-directly access R objects.
14:00	Bioinformatics 2	Michael Lawrence	Interfacing R/Bioconductor with Hail, a Spark-based platform for genomics	Sina Rueger	Guillaumet 1+2
Hail is a Spark-based framework for genomic computing at scale. We have explored the application of deferred evaluation to the construction of an interface between R and Hail. The interface implements standard base R and Bioconductor APIs on top of Hail by constructing expressions in Hail's interface language and evaluating them using sparklyr. Users require no special knowledge of Hail or Spark. We will describe the design of the interface and demonstrate the manipulation of a Hail-backed SummarizedExperiment object, the core abstraction for genomic data in Bioconductor.
14:18	Bioinformatics 2	Federico Marini	iSEE: interactive and reproducible exploration and visualization of genomics data	Sina Rueger	Guillaumet 1+2
Data exploration is crucial in the comprehension of large biological datasets, generated by high-throughput assays such as sequencing, with interactivity as key aspect to generate insightful outputs. Most existing tools for intuitive and interactive visualization are limited to specific assays or analyses, and lack support for reproducible analysis. Sparked from a Bioconductor community-driven effort, we have built a general-purpose tool, iSEE - Interactive SummarizedExperiment Explorer, designed for interactive exploration of any experimental data which can be stored in a SummarizedExperiment object, i.e. an integrative data container for storing matrices of assays and tables of associated metadata. iSEE (https://bioconductor.org/packages/iSEE/) is implemented in R and Shiny, and is compatible with many existing R/Bioconductor packages for high-throughput biological data. Essential features include: - A highly customizable interface with different panel types, simultaneously viewing and linking panels to each other - Automatic tracking of the exact R code generating all visible plots for full reproducibility - Interactive tours to showcase datasets and findings - Extendable analyses with custom panel types - Seamless deployment as an online companion browser for collaborations and publications.
14:36	Bioinformatics 2	Pol Castellano-Escuder	POMA: Shiny tool for targeted metabolomic data statistical analysis and visualization	Sina Rueger	Guillaumet 1+2
Similarly to other high-throughput technologies, metabolomics usually faces a data mining challenge to provide an understandable and useful output to advance in biomarker discovery and precision medicine. Biological interpretation of the results is one of the hard points and several bioinformatics tools have emerged to simplify and improve this step. However, sometimes these tools accept only very simplistic data structures and, for example, they do not even accept data with several covariates. POMA is a free, friendly and fast Shiny interface for analysing and visualization data after an analytical targeted metabolomics process and its hosted on https://polcastellano.shinyapps.io/POMA/. POMA allows the user to go from the raw data to statistical analysis. The analysis is organized in three blocks: "Load Data" (where user can upload metabolite data and a covariates file), "Pre-processing" (value imputation and normalization) and "Statistical analysis" (univariate and multivariate methods, limma, correlation analysis, feature selection methods, random forest, etc.). These steps include multiple types of interactive data visualization integrated in an intuitive user interface that requires no programming skills. Finally, POMA also generates different automatic statistical and exploratory reports to facilitate the analysis and interpretation of the results.
16:00	Keynote	Martin Morgan	How Bioconductor advances science while contributing to the R language and community	Christine Choirat	Concorde 1+2
The Bioconductor project has had profound influence on the statistical analysis and comprehension of high-throughput genomic data, while contributing many innovations to the R language and community. Bioconductor started in 2002 and has grown to more than 1700 packages downloaded to ½ million unique IP addresses annually; Bioconductor has more than 30,000 citations in the scientific literature, and positively impacts many scientific careers. The desire for open, reproducible science contributes to many aspects of Bioconductor, including literate programming vignettes, multi-package workflows, teaching courses and online material, extended package checks, use of formal (S4) classes, reusable ‘infrastructure’ packages for robust and interoperable code, centralized version control and support, nightly cross-platform builds, and a distinctive release strategy that enables developer innovation while providing user stability. Contrasts between Bioconductor and R provide rich opportunities for reflection on establishing open source communities, how users translate software into science, and software development best practices. The ever-changing environment of scientific computing, especially the emergence of cloud-based computation and very large and heterogeneous public data resources, point to areas where Bioconductor, and R, will continue to innovate.

Time	Session	Speaker	Title	Chair	Room
09:15	Keynote	Bettina Grün	Tools for Model-Based Clustering in R	Andrea Rau	Concorde 1+2
Model-based clustering aims at partitioning observations into groups based on either finite or infinite mixture models. The mixture models used differ with respect to their clustering kernel, i.e., the statistical model used for each of the groups. The choice of a suitable clustering kernel allows to adapt the model to the available data structure as well as clustering purpose. We first give an overview on available estimation and inference methods for mixture models as well as their implementations available in R and highlight common aspects regardless of the clustering kernel. Then, the design of the R package flexmix is discussed pointing out how it provides a common infrastructure for fitting different finite mixture models with the EM algorithm. The package thus allows for rapid prototyping and the quick assessment of different model specifications. However, only a specific estimation method and models which share specific characteristics are covered. We thus conclude by highlighting the need for general infrastructure packages to allow for joint use of different estimation and inference methods and post-processing tools.
10:25	Shiny & web	Christophe Dervieux Romain Lesur	Native Chrome Automation using R	Nic Crane	Concorde 1+2
The Chrome web browser can be fully controlled in headless mode. A headless browser is a convenient tool for web scraping and automated testing of web sites: one can program all the actions performed in a web browser thanks to the chrome devtools protocol. It is also useful for creating a PDF or taking a screenshot of a web page. Headless Chrome automation is widely used in the node.js ecosystem mainly through the Puppeteer library. In this talk, we will show you how to take the full control of Chrome directly from R without any dependency on another language or tool (node.js or Java are not required). We will be looking at the principles that make it possible and at some working examples like the chromeprint() function in pagedown package. We will also illustrate what is possible with some examples using the in-development crrri package (https://github.com/RLesur/crrri) that offers a low level access to the chrome devtools protocol.
10:30	Shiny & web	Victor Perrier Fanny Meyer	Our journey with Shiny : some packages to enhance your applications	Nic Crane	Concorde 1+2
Shiny has become a must in the R environment. In any sector of activity, {shiny} is no longer only used for prototypes but also for applications in production, to create reports, dashboards, tools... Shiny offers a basic foundation for application development and has a very rich API that allows developers to write their own extensions. During our journey with shiny we were able to write several packages to include new features and take advantage of its hidden gems. We will therefore present you our ecosystem of packages around Shiny: - Make the interface more attractive with {shinyWidgets} - Build Proxy for htmlwidgets for smooth integration into Shiny, an example with {billboarder} - Minimal busy indicator with {shinybusy} - Monitor app usage with {shinylogs}
10:35	Shiny & web	Julio Trecenti	auth0: Secure Authentication in Shiny with Auth0	Nic Crane	Concorde 1+2
auth0 is a freemium service used to add authentication in web applications. We present the auth0 R package, which implements solutions to use auth0 with Shiny apps. With a simple interface, it is possible to add authentication interfaces to shiny apps, including social (Google / Facebook / Twitter / Linkedin), enterprise (Active Directory / LDAP), app creator's own database and many others. The auth0 R package needs only two things to work: create a configuration YML file with auth0's API keys, and replace the shiny::shinyApp() function with auth0::shinyAuth0App(). The package also includes i) helpers to collect and operate the logged user information inside the app and ii) logout buttons. The auth0 service can be used offline (localhost), in a shiny-server or even with RStudio's shinyapps.io service. In this lightning talk, we will show how auth0 R package works and, if possible, a live demonstration of an app running with auth0.
10:40	Shiny & web	Maxim Nazarov	Packaging shiny applications	Nic Crane	Concorde 1+2
Shiny applications are ubiquitous nowadays and traditionally created as a collection of R script files (optionally with external resources, such as images, css files etc). When the applications grow larger it may become difficult to manage these files. I will advocate the use of R packages for creation of shiny applications. This approach allows to leverage all the advantages of the R packaging ecosystem, including managing namespaces and dependencies, versioning, documentation and tests. These are especially important when deploying and supporting shiny applications in production. In particular I will talk about using functions to define both "server" and "ui" components, and separating logical application pieces into shiny modules for easier structuring and sharing. The aim of this talk is to encourage people to try this approach by discussing the differences with the traditional (file-based) approach, advantages of having an R package and things to watch out for. Finally I will mention how a packaged shiny application can be deployed in an enterprise context using ShinyProxy.
10:45	Shiny & web	Abbas Rizvi	Photon : Building an electron-shiny app using a simple RStudio add in.	Nic Crane	Concorde 1+2
Have you ever wanted to deliver your model and Shiny application to someone that can't allow their data to leave their computer, or to deliver your Shiny application as a quick prototype that a user could use as a simple Desktop application without deploying over the Internet? Last year at useR! 2018 Katie Sasso showed how to do so by building an Electron-Shiny executable, which allows you to deliver your Shiny app as a Desktop application (https://www.youtube.com/watch?v=ARrbbviGvjc). This year Abbas Rizvi is going to take it a step further and show a super easy was to use a newly available RStudio Add-In that makes the process super simple and convenient. RStudio Add-Ins allow the IDE to be extended and customized, creating automated and faster work streams. (https://www.rstudio.com/resources/webinars/understanding-add-ins/). After this talk you will be able to create a Windows or Mac Desktop Shiny App with a few simple clicks of the RStudio Add-in.
10:50	Shiny & web	Andreas Wittmann	Visualizing Huge Amounts of Fleet Data using Shiny and Leaflet	Nic Crane	Concorde 1+2
When using huge amounts of fleet data data scientists like me often use Shiny in combination with Leaflet to further investigate such data and furthermore to present the user a first prototype of a future data product. When using Leaflet, you can quickly reach limits on the amount of data. As a rule of thumb, a maximum of 10,000 data points for visualization seem possible here. The solution is often clustering, but sometimes you want to be able to see all the data. Therefore a tile layer based approach is used for the visualization. With this technique, there is virtually no limit to the amount of data. In this Talk I will show how R can be used with the packages Leaflet and Plumber to accomplish this. As an example record, millions of taxi pickups in NYC are used. The data is here persisted in a NoSQL database like Cassandra.
10:25	Methods & applications	Quay Au	fxtract - Feature Extraction from Grouped Data	Robin Lovelace	Cassiopée
Researchers and practitioners nowadays have access to large longitudinal datasets. These datasets often need to be aggregated by groups in order to be utilized in statistical analyses or machine learning models, and to enable better understanding by users. Especially for large datasets, this process can be difficult and is often prone to errors. We present the package "fxtract", which provides an easy to use API to facilitate the extraction of user-defined features from grouped data. Similar to the summarize-functionality of the dplyr package, our package helps reducing multiple rows of data into one row for each group. fxtract offers easy manageability of large datasets, since data for each group is only read into memory when needed. The package is written in R6 and parallelization is supported by using the R-Package future. Project page: https://github.com/QuayAu/fxtract
10:30	Methods & applications	Megan Beckett	Spatial Optimisation with OSRM and R	Robin Lovelace	Cassiopée
Open Source Routing Machine (OSRM) is a high-performance routing engine for calculating the shortest paths through a road network. These calculations are available via Google Maps. However, queries against a local OSRM server are orders of magnitude faster than Google Maps. The osrm package for R exposes these routing data to a wide range of potential applications. In this talk I'll show how to easily spin up and provision an OSRM server and use it to solve some interesting spatial optimisation problems in R.
10:35	Methods & applications	Peter Brejcak	Anomaly detection in trivago	Robin Lovelace	Cassiopée
Internet business brings huge amount of data and daily challenges. For us at trivago it is very important to correctly detect nonhuman traffic or detect anomalies. We are going to share our real example of an obvious anomaly with business impact: - How was the anomaly detected? - What caused the anomaly? Are we able to answer this confidently? - What action should we take? - How can, and can't R help us?
10:40	Methods & applications	Angeline Protacio	Using R and the Tidyverse to Play Fantasy Baseball	Robin Lovelace	Cassiopée
Of the existing R packages that help analyze baseball statistics, few are geared toward fantasy baseball, which benefits from projection data for drafting a team, and updated data for daily analysis to set game day rosters and maintain a fantasy team. This talk describes how to use the tidyverse package suite to bring in baseball projected and actualized data, analyze baseball statistics relevant to fantasy baseball, draft a competitive fantasy baseball team, and maintain this team throughout the course of a season. By applying the tidyverse package suite of data cleaning and visualization tools to baseball statistical analysis, R users already familiar with this popular suite of packages can learn how to play and succeed at fantasy baseball. Users not already familiar with fantasy sports can expect a quick primer on fantasy baseball and statistics, sourcing baseball data for analysis, and analytic approaches to apply to a league of their own.
10:45	Methods & applications	Alicja Fras	Optimizing children sleeping time using regression and machine learning	Robin Lovelace	Cassiopée
Sleeping child is an everyday dream come true of every young parent. Unwilling to fall asleep child can change one's life into a nightmare and those, who calmly sleep long hours used to be the object of desire. There are many functions, that the parents try to optimize - some simply want their children to sleep as much as possible, some want to wake up later and enjoy their kids in the evenings or the opposite (you cannot have it all), in our case we also wanted to synchronize sleeping times of our two children. To reach the goal, they try to find some patterns in children’s behavior – they impose some daily routines, do not let children nap too long during the daytime or try to put them in bed earlier, hoping that slower pace will lull them. However intuitions may be misleading, and that is why after a few months of observing my children sleeping patterns I decided to hire statistical tools. I wanted to verify the hypothesis if we should wake them up earlier in the morning (which in case of our kids turned out to be true), but the framework allowed me to detect also some more useful patterns and draw hints. After first attempts with linear regression, I decided to introduce my kids to the neural network and let it learn. My further idea is to create a web app, where one can upload a data file with observed sleeping times and habits, which would help draw reliable conclusions.
10:25	Models & methods	Caterina Constantinescu	Adjusting reviewer scores for a fairer assessment via multi-faceted Rasch modelling	Michela Battauz	Caravelle 2
Selecting submissions for a conference can be viewed as a measurement problem: in principle, organisers aim to accept the "best" submissions, but the precise manner in which this is achieved can vary considerably. It is also common to involve multiple reviewers in the process, and it may not always be the case that all reviewers manage to rate all submissions. Hence, there is a chance that some particularly harsh reviewers may rate the same submission (and put it at a disadvantage), or some more lenient reviewers may happen to rate the same submission and propel it higher in the ranking. A solution to this issue is offered by multi-faceted Rasch models, which view submission scores as a function of not just the quality of the submission in itself, but also reviewer severity. This allows to adjust submission scores accordingly, providing a fairer measurement process. Conveniently, the R package `TAM` allows to estimate this type of model. In this talk, I will walk you through an example of how `TAM` was used on data collected as part of the review process for a data science conference in Scotland.
10:30	Models & methods	Marie Perrot-Dockès	Penalized regressions to study multivariate linear models : the VariSel package.	Michela Battauz	Caravelle 2
We propose a new methodology to study multivariate linear models. It consists in estimating the covariance matrix of the responses beforehand and plugging this estimator into a penalized regression model to perform variable selection. Depending on the application field and the data at hand, different penalizations are preferred. For example, recoursing on a l1-norm penalty performs a "simple" selection in the regression coefficients. In certain situations, we need to select the same variable for all the responses or a group of variables all together. This can be done by means of a penalty similar to the sparse group Lasso (Yuan and Lin (2006)). Another example is the fused-Lasso penalty, where one want to influence variables to have the same coefficient (Hoefling (2010)). We study an example in genomic for study water-use efficiency (WE) of vine. The covariables are the number of replicates of the different alleles of different genes and the responses characterize the WE of the different vines. The aim is to select genes or alleles that have an effect on this WE. We consider several ways to study the problem: either by selecting an allele of a gene for a specific response or by influencing it to have the same coefficients for all the responses. It can also be interesting to select all the alleles of the same gene together. The presentation will show how to do all of this with the VariSel package.
10:35	Models & methods	Christophe Dutang	Maximum spacing estimation, a new method in fitdistrplus	Michela Battauz	Caravelle 2
Maximum spacing estimation (MSE) introduced by [1] is a method for estimating the parameters based on the probability differences of order statistics. More precisely, the method consists in maximizing of the geometric mean of spacings in the data, which are the differences between the values of the distribution function at sorted observations. MSE has been proved to be reliable method both for heavy-tailed models by [2] and lighted-tailed models by [1]. Currently, only the BMT package provides the MSE for a single distribution. In this talk, we study the development and integration of this method in the fitdistrplus package [3]. This latter package provides maximum likelihood and other methods for any distribution characterized by d, p, q functions. Using the S3 class and generic methods, this estimation method nicely fits the package framework. An uncertainty analysis is carried out on simulated datasets from both light and heavy tailed models. Finally, we illustrate the relevancy of this method to model insurance losses on real actuarial datasets. [1] Ranneby, Bo (1984). The maximum spacing method. An estimation method related to the maximum likelihood method. [2] Wong, T.S.T; Li, W.K. (2006). A note on the estimation of extreme value distributions using maximum product of spacings. [3] M. L. Delignette-Muller, C. Dutang (2015). fitdistrplus: An R Package for Fitting Distributions.
10:40	Models & methods	Marc Choisy	rama: an R interface to the GAMA agent-based modeling platform	Michela Battauz	Caravelle 2
Agent-based models (ABM) are discrete-event computer models focusing on the interactions between autonomous entities. The downsides of a high flexibility are that (i) ABM must be defined algorithmically (requiring programming) and (ii) ABM are black-box models, the study of which resort to large numbers of simulations. The GAMA modeling platform offers a user-friendly language (GAML) and environment to develop and efficiently run ABM, potentially in an explicit spatial environment. It lacks, however, the statistical tools to exploit ABM simulations. The rama package is an R interface to GAMA intended to fill this gap. The package defines the S3 class experiment that inherits from the tbl_df class and that is tidyverse compatible. An experiment object is linked to an ABM model defined in a GAML (text) file and contains, for this model, a number of simulations (in rows) that differ by their seed and/or parameters values (in columns). Methods are available to intuitively subset and combine experiment objects, as well as to generate experimental designs for experiment objects. Among them, run_experiment() seamlessly calls the GAMA engine to run the simulations of an experiment object and returns it augmented with a list-column of data frames of time series of the simulations outputs. Simulations outputs can then be analysed in R for model calibration, sensitivity analysis and hypotheses testing.
10:45	Models & methods	Kaeding Matthias	RcppGreedySetCover: Scalable Set Cover	Michela Battauz	Caravelle 2
The set cover problem is of fundamental importance in the field of approximation algorithms: Given a collection of sets, find the smallest sub-collection, so that all elements from a universe are covered. A diverse range of real world problems can be represented as set cover instance such as location-allocation, shift planning or virus detection. An optimal solution to the problem via linear programming is available. Due to the computational costs involved, this is not a feasible solution for large problems. A quick approximation is given by the greedy algorithm. This talk introduces RcppGreeySetCover, an implementation of the greedy set cover algorithm using Rcpp. The implementation is fast due to the reliance on efficient data structures made available by the Boost headers and the C++ 11 standard library. Preprocessing of input data is done efficiently via the data.table package. Input and output data is a tidy two column data.frame. As a case study, we apply the package on a large scale (>= 100 million rows) hospital placement problem, which initially motivated the creation of the package.
10:50	Models & methods	Mickaël Binois	The GPareto and GPGame packages for multi and many objective Bayesian optimization	Michela Battauz	Caravelle 2
The GPareto package provides multi-objective optimization algorithms for expensive black-box functions and an ensemble of dedicated uncertainty quantification methods. Popular methods such as efficient global optimization in the mono-objective case rely on Gaussian processes or Kriging to build surrogate models. Several infill criteria have also been proposed in a multi-objective setup to select new points sequentially and efficiently cope with severely limited evaluation budgets. They are implemented, in addition to Pareto front estimation and uncertainty quantification visualization in the design and objective spaces. Rather than estimating the entire Pareto front, the GPGame package focus on finding equilibrium solutions (e.g., Nash equilibrium) that are better suited for larger number of objectives.
10:25	Switching to R	Balazi Peter	The transition from conventional tools in banking to R	Gergely Daroczi	Saint-Exupéry
Presenter: Peter Balazi Focus: Historical and future usage of R in credit risk modeling and its application Abstract: The highly regulated commercial banking industry has been slow and in general averse to change. To blame is the system legacy, processes but also managerial unwillingness. Historically and to these days, the commercial banking industry, especially risk management area, has been dominated by large software companies, SAS and alike. However this is changing rapidly, mainly led by the inner initiative of software end users. Now also increasingly recognized by management, of a need to change and adapt to the demand of end-users by realizing its potential and align to market place to attract talent. This presentation will briefly discuss the historical, current and future direction of a transition from typical conventional commercial type of statistical software (SAS) to R. Practical industry examples will be provided/shown where the choice of the statistical language is crucial and/or preferred in achieving the desired output giving the existing constrains. Brief view into this industry and best practice will be provided and encouragement given for R user community that better days lay ahead. Welcome for cooperation and sharing of knowledge will be also offered for those interested more details provided during the break in 1:1 session while approach in person.
10:30	Switching to R	Christophe Genolini	R++, a new Graphical User Interface for R	Gergely Daroczi	Saint-Exupéry
R++ is a new Graphical User Interface for R. It is intended for those who are not statisticians (doctors, psychologists, salespeople,...) but who nevertheless need high-level professional statistics. R++ is a result of collaborations with different computer-science labs specialised in Human-Computer Interaction. Through twenty or so meetings with statisticians, we have identified the tasks that are considered to be arduous, annoying or of low added value. 3 aspects have arisen: reading data (encoding problems, column separators, odd formats, distant data bases), data managment (outliers' detection, modalities' merging,...) and exporting results (text or graph). R ++ offers an easy-to-use interface simplifying processing these tasks. In reading the data, a preview is displayed allowing checking of the data and thus to proceed to various settings. For processing the data, several tools are available. For example, it is possible to instantly graphicaly represent all the variables of a database or to correct the types using a colour system. For exporting the data, interface allows adjustments to graphical settings, then a simple drag&drop can export graphs. Naturally, for the sake of reproducibility of analyses, each action generates the corresponding R code. A code editor and an R console are also available for more advanced uses. R++ is free for academics, students, and associations.
10:35	Switching to R	Kieran Martin	R in Pharma: A tailored approach to converting programmers to R in an industry resistant to change	Gergely Daroczi	Saint-Exupéry
The pharmaceutical industry has historically, overwhelmingly made use of one software package, SAS. Almost every submission made to regulatory agencies such as the FDA and the EMA have been made using SAS code, and SAS data formats. For those of us who are enthusiastic about promoting usage of R, this can lead to an uphill struggle, but things are changing. R has become more and more popular, and appetite for R usage is growing. In this talk I will discuss some of the efforts to promote R within Roche, with a focus on two particular methods: - Targeted training, focused on problems programmers have to solve every day - Tidyverse training, and a focus on functional programming - In house data and as close to real examples as possible, to show direct applications for work in R - New R packages. Multiple different packages have been developed to support different activities. In this talk I will focus primarily on diffdf, a package that is available on CRAN, which used allows detailed dataset comparison, meeting an unmet need
10:40	Switching to R	Kevin Kuo	Community Driven Data Science in Insurance	Gergely Daroczi	Saint-Exupéry
We introduce Kasa AI, a community driven initiative for open research and software development in insurance and actuarial analytics. Open source software has been credited for recent rapid advances in machine learning and its applications in various industries. The insurance industry, being a heavily regulated industry, has been slower to embrace open source, but recent trends indicate that actuaries are shifting their workflows to adapt to new technologies. We discuss motivations for the community, current projects, which span both life and nonlife insurance, and how tooling in the R ecosystem has enabled reproducible research at scale.
10:45	Switching to R	Alexander Kowarik	unconfUROS and one of its outputs vornoiTreemap	Gergely Daroczi	Saint-Exupéry
The annual conference on the use of R in official statistics is since 2018 enriched by a side event similar to the rOpenSci Unconfs. Similar to the idea of the awesome official statistics software list, this should complement an abstract top-down approach such as the common statistical production architecture with fast bottom-up solutions. One of the implemented ideas 2018 was to provide an easy way to generate Voronoi treemaps in R. The motivation came from examples of the "Price Kaleidocscope" from the federal statistical office of Germany. The newly released CRAN package voronoiTreemaps provides the functionality to create Voronoi treemaps with the JavaScript library d3-voronoi-treemap and to integrate them into a shiny app.
10:50	Switching to R	Konstantinos Soulanis	An R implementation of a model-based estimator – a UK case study	Gergely Daroczi	Saint-Exupéry
The UK Office for National Statistics (ONS) uses a model-based conditional ratio estimator to estimate the product by industry sales of products produced by the UK manufacturing sector (the PRODCOM survey) and provided by the UK services sector (the Annual Survey of Goods and Services - ASGS). As ASGS is a recently developed survey by the ONS, R was chosen to be part of the production system over other software packages because of its abilities to handle large datasets (approximately 2000 product by 300 industries by 40000 businesses) and complex calculations efficiently and in a timely manner. R also had its advantages in that the results can be easily visualised as part of the processing. Using the recent set of tidyverse based packages this was able to be done in such a way for the methodology to be implemented concisely. This talk will briefly describe the methodology, the pipeline of functions created and a comparison of the outputs to the design-based Horvitz-Thompson (expansion) estimator. It is the author's hope that this pipeline will be packaged up and be made available for other National Statistics Offices and interested researchers who have surveys which require a similar estimation scheme.
10:55	Switching to R	Roxane Legaie	Using advanced R packages for the visualisation of clinical data in a cancer hospital setting	Gergely Daroczi	Saint-Exupéry
NGS-based assays for cancer testing have been increasingly adopted by clinical laboratories, especially targeted gene panels which are now commonly used to guide diagnostic and therapeutic decisions. While there are many databases for the annotation and curation of individual variants, there is a lack of off-the-shelf software for the mining of those databases. The Pathology Department at Peter MacCallum Cancer Centre conducts variant curation using in-house software, PathOS. This database contains variant information from thousands of clinical tumour samples, from a variety of NGS assays (mostly gene panels), and a range of tumour and tissue types. Visualising variations across all clinical contexts can be used to validate important findings, improve patient treatments and refine tumour classifications. In this talk I will show how, by using advanced R packages, such as maftools and oncoplots, we are able to interrogate our database and make sense of variants from thousands of patients across multiple cancer types. I will discuss the need for implementing such solutions into our routine curation QC process, and will describe the challenges around working with data stored in various file formats and implementing solutions into a curation system in production.
10:25	Bioinformatics & biostatistics	Aarón Ayllón-Benítez Patricia Thebault	rGSAn: a R package dedicated to the gene set analysis using semantic similarity measures.	Guillaume Devailly	Ariane 1+2
The revolution in high-throughput sequencing technologies, which enables the acquisition of gigabases of DNA sequences, is leading to great applications for understanding the relationships from the genotype to the phenotype. The accumulation of massive amounts of omics data is then providing relevant biological information thanks to the integration of annotation information coming from sources available in the "omics" domains. These knowledge resources are then essential for interpreting the functional activity of genes, even though, managing the large number of annotation terms associated to a gene set level is a difficult task. To address this issue, we introduce rGSAn, a R package dedicated to the Gene Set Annotation. By adopting a strategy combining semantic similarity measures, and data mining approaches, rGSAn is able to perform an unified and synthetic annotation for a gene set of interest.To do so, the best partition of the annotation terms is computed using semantic similarity measures. Then, the most relevant terms are identified using a decision tree algorithm and an heuristic implementation of the general set cover problem. In addition, rGSAn offers new visualization facilities to interactively explore the annotation results according to the hierarchical structure of Gene Ontology.
10:30	Bioinformatics & biostatistics	Goknur Giner Alexandra Garnham	Pathway-VisualiseR: An Interactive Web Application for Visualising Gene Networks	Guillaume Devailly	Ariane 1+2
RNA-seq based expression profiling allows us to quantify gene expression changes in pertinent biology and to discover a set of biomarkers that are potentially the origin of the biological phenomenon of interest. Moreover, investigating the collective behaviour of the biomarker genes as a set, utilising databases such as Gene Ontology (GO), KEGG (Kyoto Encyclopedia of Genes and Genomes), STRINGdb (Search Tool for the Retrieval of Interacting proteins database), play an important role in gaining a comprehensive knowledge about the relevant biological inquiry. Here, we built a web-based interactive RShiny tool in order to gain further insights into the alterations within gene sets and the changes in the interactions between them. PathwayVisualiseR implements a fit object, which is the output from a linear model that was fit for each gene for a given series of samples, from the limma R package. PathwayVisualiseR, therefore, enables the user to incorporate the results from gene set tests, such as Camera, Roast and Fry. To summary, visualising gene set interactions with PathwayVisualiseR helps in speeding up the process of comprehending large or complex data sets and unfolds the relationship among the functional pathways and biological mechanisms.
10:35	Bioinformatics & biostatistics	Víctor Granda	Compiling a global database of sapflow measurements with R: Workflow and tools for the SAPFLUXNET database	Guillaume Devailly	Ariane 1+2
Understanding the global patterns and drivers of plant transpiration and its physiological control requires compiling and harmonising heterogeneous ecophysiological datasets. The SAPFLUXNET project (http://sapfluxnet.creaf.cat/) has compiled the first global database of transpiration from sap flow measurements, including > 10 million sub-daily records from 2732 plants and 176 species, measured in > 200 globally distributed datasets. Here we present the SAPFLUXNET data infrastructure, which allows implementing a reproducible workflow in the R environment. The "sapfluxnetQC1" package (https://github.com/sapfluxnet/sapfluxnetQC1) is an internal-use package which implements data harmonisation and quality control, using Rmarkdown-generated reports and interactive Shiny apps. The "sapfluxnetr" package (https://github.com/sapfluxnet/sapfluxnetr) is aimed to ease data access, aggregation and analysis, using two new S4 classes ("sfn-data", and "sfn_data_multi"). The SAPFLUXNET database and its data infrastructure have been designed to be open and publicly available, supporting new data-driven analysis of global vegetation functioning and also facilitating future updating and maintenance.
10:40	Bioinformatics & biostatistics	Fabiola La Gamba	Bayesian sequential integration within a preclinical PK/PD modeling framework using rstan package: Lessons learned	Guillaume Devailly	Ariane 1+2
Although Bayesian methods are expanding considerably, their applications in the field of pharmacokinetic/pharmacodynamic (PK/PD) modelling is still relatively limited. In this work, Bayesian techniques are used to estimate a novel PK/PD model developed to quantify the PD synergy between two compounds using historical in vivo data. The model is fitted using package rstan, the R interface to Stan. Stan is a recently developed software package which allows an efficient estimation using the No-U-Turn Sampler (NUTS). rstan facilitates the use of this powerful tool, allowing to run Stan models within R. Since the data consist of a series of trials performed sequentially, a Bayesian sequential integration is considered: the posteriors resulting from the analysis of one trial are used to specify the priors of the next trial. Challenges and opportunities of this technique will be discussed with respect to: (i) prior specification, (ii) random effect choice, (iii) experimental design. In addition, the results from an extensive simulation study assessing the performance of the Bayesian sequential integration for an increasing model complexity are evaluated. The results suggest that the Bayesian sequential integration is promising in certain settings. The specification of informative priors and a suitable experimental design are nevertheless advisable, especially in preclinical studies.
10:45	Bioinformatics & biostatistics	Boris Hejblum	VICI: a Shiny app for accurate estimation of Vaccine Induced Cellular Immunogenicity with bivariate modeling	Guillaume Devailly	Ariane 1+2
In vaccine clinical development, it is important to assess the ability of a candidate vaccine to generate immune responses. Cellular immunogenicity is usually measured by Intra-Cellular Staining (ICS) after specific stimulation. The usual method for the analyzing such data is to i) first subtract the response observed in non-stimulated cells from each stimulated sample; ii) then perform an inter- or intra-arm comparison of the percentages of cells producing cytokine(s) of interest. Step i) aims at capturing the antigen-specific response, but the subtraction may induce biased estimates and compromise type 1 error and harm statistical power. We have proposed a bivariate regression model as a new method for an accurate estimation of the vaccine effect from ICS data in vaccine trials. This allows to model a linear relationship between the non-stimulated response and the antigenic responses, while taking into account both of their measurement errors, taking all available cellular response information into account. Our method displays good performances in terms of bias, type I error control and statistical power under various scenarios in numerical simulations. We present a user-friendly R Shiny application (relying on the nlme package) that we implemented for immunologists to be able to use our method directly, and we applied it to analyze data from two HIV vaccine trials.
10:50	Bioinformatics & biostatistics	Marion Louveaux	Tools for 3D/4D interactive visualisation of cells and biological tissue	Guillaume Devailly	Ariane 1+2
Modern microscopy techniques allow to image a high number of biological samples (cells, tissues, whole organs and organisms). Some tools even allow to follow a sample over time. Image processing techniques can extract many information from the images about biological objects (number, location in 2D or 3D space and time, biological status, shape...). Analysis of such biological data is facing to two main challenges: the need to take into account the spatial and temporal context of each piece of information, and the need to gather and compare many images pre-processed with multiple softwares from many samples. The R package {cellviz3d} (https://github.com/marionlouveaux/cellviz3d) is a wrapper on the {plotly} package to help visualization of meshes and points structures in 2D, 3D and 3D+time of one or several samples. Data can be extracted directly from images in R or with image analyses softwares like Image/Fiji or MorphoGraphX. Typical use cases that will be presented in this talk are the visualization of the surface of a biological tissue or the visualization of nuclei. Data presented will be data outputs read with packages like {mgx2r}, which bridges MorphoGraphX with R, or {mamut2r}, which bridges MaMut Fiji plugin with R, both available on GitHub (@marionlouveaux).
10:55	Bioinformatics & biostatistics	Deniz Topcu	Analysis of laboratory test requests in a university hospital: A Shiny App for association analysis as a demand management tool	Guillaume Devailly	Ariane 1+2
Increasing medical expenses throughout the world necessitate more reasonable utilization of all healthcare resources, as well as clinical laboratory tests. Laboratory tests are known to be ordered either to screen individuals for diseases or to confirm the diagnosis of individuals with disease symptoms. Diagnostic tests should be utilized appropriately according to current guidelines and clinical algorithms. Overutilization of tests is considered as incorrect practices in both medical and financial contexts. In this study, to assess utilization of test requests in a hospital environment, we applied apriori based method of association analysis to determine the relationship between different laboratory tests and clinical settings. Using the apriori and shiny package we have development interactive application. Our application is user-friendly and users who are not familiar with R programming or statistical software can import data from Laboratory Information System and oversee test utilization. Managing test requests is an emerging topic in clinical laboratory; data science tools can help laboratory medicine specialists to be more proactive.
11:30	Model deployment	Savin Goyal	Machine Learning Infrastructure at Netflix	Kevin Kuo	Cassiopée
At Netflix, our data scientists apply machine learning to an ever-increasing range of business problems, from title popularity predictions to quality of streaming optimizations. However, building and operating production grade machine-learning systems is a highly non-trivial endeavor requiring expertise in two highly evolved disciplines: systems engineering and ML. To empower our data scientists to be able to build, deploy and operate machine learning solutions without any external support, we provide a robust infrastructure for machine learning, ensuring models can be promoted quickly and reliably from prototype to production, and enabling reproducible and easily shareable results. Given the breadth of techniques and the state-of-the-art algorithms available within the R ecosystem, a significant fraction of our data scientists, prefer R as their lingua franca, necessitating the need to provide first-class infrastructure support in R. In this talk, we introduce the techniques and underlying principles driving our approach with a particular focus on models written in R. We talk about Metaflow, our human-centric machine learning infrastructure, which enables our users to iterate on their machine learning pipelines quickly by leveraging existing libraries and familiar abstractions.
11:48	Model deployment	Angus Taylor	Deploying machine learning models at scale	Kevin Kuo	Cassiopée
Data scientists face multiple challenges when deploying machine learning models at the scale required for many commercial applications. In this talk, I will discuss best practices for deploying R models in production, including: how to operationalize predictive workflows with event- or frequency-based scheduling; how to scale model training and scoring operations with cloud-based clusters of virtual machines; how to maintain parity between development and production environments using Docker containers; the use of continuous integration and continuous delivery (CI/CD) tools to build and test production code. I will present a demo from a retail product forecasting context, in which R models are deployed in an end-to-end predictive workflow on cloud-based architecture.
12:06	Model deployment	Christoph Bodner Thomas Laber	Serverless Computing for R	Kevin Kuo	Cassiopée
R is a great language for rapid prototyping and experimentation, but putting an R model in production is still more complex and time-consuming than it needs to be. With the growing popularity of serverless computing frameworks that offer Functions-as-a-Service (like AWS Lambda, Azure Functions) or Container-as-a-Service (ECS and ACI) we see a a huge chance to allow R developers to more easily deploy their code into production. We want to show you how you can use serverless computing to easily put models in production. We will discuss the pros and cons of various approaches and how we implemented a completely serverless data science platform for R in Microsoft Azure that you can arbitrarily scale (as long as your credit card allows:) up and down. While we come from an Azure background, porting the ideas over to AWS or Google Cloud should be straight forward.
12:24	Model deployment	David Smith	A DevOps process for deploying R to production	Kevin Kuo	Cassiopée
So you've built an amazing model in R. It generates great predictions or recommendations on your desktop. Now, how do you get it into production? Deploying R functions within Docker containers, and exposing them with the 'plumber' package, is a simple and effective way to integrate R into applications via a REST API. We can further provide scale for high-volume workloads by deploying that container to Kubernetes. As the complexity grows, however, managing and updating deployments can be challenging. In this talk, we describe a CI/CD based process in Azure DevOps to automate the entire build, test and deploy process for R-based models in production. The process emphasizes model reproducibility, by capturing and tracking changes in the code, data, tests and configurations that define the model. You will learn how, with a check-in to GitHub as the trigger, your model can be automatically retrained, optimized, built, validated, and – with human approval if necessary – released. Once deployed to the production environment – in this talk we'll focus on scalable open-source frameworks like Kubernetes – the model will be subject to continuous monitoring and performance tracking until such time that a new model version is warranted, and the DevOps lifecycle begins again.
12:42	Model deployment	Friedrike Preu	Authentication and authorization in plumber with the sealr package	Kevin Kuo	Cassiopée
Application Programming Interfaces (APIs) have become the most common way how services "talk" to each other, e.g. in a typical frontend-backend design. The plumber package offers capabilities to implement an API in R, making it possible to use R for software development use cases that often require security best practices. In my talk, I will show how we can use plumber filters to secure plumber APIs. I then present the sealr package (github.com/jandix/sealr) which provides standardized strategies for authentication and authorization, namely JSON Web Tokens (JWT) (implemented), OAuth2 and general API tokens (under development at time of submission). sealr is inspired by the manifold passport.js package for Node.js. The main functionality of sealr is verifying tokens sent to plumber - how those tokens are issued by plumber is not covered (yet) as it is highly application-specific. However, we provide implementation examples for each method. As authentication middleware specifically developed for the plumber framework, sealr differs from packages such as sodium, openssl, jose and bcrypt that implement specific encryption and/or hashing algorithms. secret, digest and keyring are used for securely storing (R) objects. With sealr, we aim to make authentication and authorization as easy as possible to implement for R users so that plumber APIs can be used in security-sensitive environments.
11:30	Performance	Gábor Csárdi	pak: a fresh approach to package installation	Helena Kotthaus	Saint-Exupéry
pak is a new package manager that makes package installation fast, safe and convenient. Fast Fast downloads and HTTP queries. pak performs all HTTP requests concurrently. Fast installs. pak builds and installs packages concurrently. Metadata and package cache. pak caches package metadata and all downloaded packages locally. It does not download the same package files over and over again. Lazy installation. pak only installs the packages that are really necessary for the installation. Safe Private library (pak's own package dependencies do not affect your regular package libraries and vice versa). Every pak operation runs in a sub-process, and the packages are loaded from the private library. pak avoids loading packages from your regular package libraries. pak warns and requests confirmation for loaded packages. Dependency solver. pak makes sure that you end up in a consistent, working state of dependencies. If finds conflicts up front, before attempting installation. Convenient BioC packages. pak supports Bioconductor packages out of the box. It uses the Bioconductor version that is appropriate for your R version. GitHub packages. pak supports GitHub packages out of the box. It also supports the Remotes entry in DESCRIPTIONfiles, so that GitHub dependencies of GitHub packages will also get installed. Package sizes. For CRAN packages pak shows the total sizes of packages it would download.
11:48	Performance	Arun Srinivasan	Summary of developments in R's data.table package	Helena Kotthaus	Saint-Exupéry
data.table is an R package that allows for fast and memory efficient data manipulation with a concise and flexible syntax. It has over 8000 questions on StackOverflow, is frequently a featured package and one of the most downloaded packages (https://www.r-pkg.org), has over 60 contributors (https://github.com/Rdatatable/data.table/graphs/contributors) and is active in development for over 10 years. Over the last couple of years, several new functionalities and improvements, including parallelisation of several operations, have been made including file reading (`fread`), file writing (`fwrite`), subset operations, automatic secondary indexing, ordering, non equi joins, overlapping joins, rolling functionalities etc. In this talk, all such improvements will be summarised showcasing different tasks of varying complexity that can be tackled using data.table with ease and with little code. The intuition behind the syntax of data.table will also be explained. Finally, a quick benchmark of some common tasks will be highlighted to clearly show the performance of data.table. This is particularly useful to R users who might benefit a lot from performance. Main author: Matt Dowle, hacker at H2O.ai Co-author: Arun Srinivasan (me), Analyst, Millennium Management. Project page: https://github.com/Rdatatable/data.table/wiki
12:06	Performance	Jim Hester	Real-time file import with the vroom package	Helena Kotthaus	Saint-Exupéry
File import in R could be considered a solved problem, with multiple widely used packages (data.table, readr, and others) providing fast, robust import of common formats in addition to the functions available in base R. However I feel there is still room for improvement in existing approaches. vroom is able to index and then query multi-Gigabyte files, including those with categorical, text and temporal data, in near real-time. This is a huge boon for interactive data analysis as you can jump directly into exploratory analysis without sampling or long waits for full import. vroom leverages the Altrep framework introduced in R 3.5 along with lazy, just-in-time parsing of the data to provide this improved latency without requiring changes to existing data manipulation code. I will throughly explain the techniques used in vroom to ensure good performance, describe challenges overcome in implementing it, and provide an interactive demonstration of its capabilities.
12:24	Performance	Henrik Bengtsson	A Future for R: Simplified Parallel and Distributed Processing	Helena Kotthaus	Saint-Exupéry
In this talk, I'll present the future framework, how to use it, what is new, and what is on the roadmap. The future ecosystem provides a simple, unified framework for parallel and distributed processing in R. It allows you to "write parallel code once" regardless of the operating system and compute environment. At the same time, it allows the user to decide on where and how to parallelize your code. In programming, a 'future' is an abstraction for a value that may be available at some point in the future. The non-blocking nature of futures makes them ideal for asynchronous evaluation of expressions in R, e.g. in parallel on the local machine, on a set of remote machines, via an HPC job scheduler, or in the cloud. At its core, there are two construct: 'f
12:42	Performance	Stepan Sindelar	FastRCluster: running FastR from GNU-R	Helena Kotthaus	Saint-Exupéry
FastR is an open source alternative R implementation that aims to be compatible with GNU-R, provide significantly faster execution of R code, and high-performance interoperability with other languages including Python. Although FastR can run complex R packages including Rcpp, dplyr or ggplot2, achieving near full compatibility with GNU-R is still work in progress. However, there is already a lot that FastR can offer, but switching completely to FastR may look like a big step. The goal of FastRCluster is to provide the intermediate ground and let the GNU-R users easily try FastR and/or use FastR to run only selected parts of their apps. FastRCluster is an R package, targeted at GNU-R, that provides a seamless way to run FastR inside GNU-R using the existing interface of the parallel package, which also integrates well with the future and promise packages. The talk will, for example, present how FastRCluster can be leveraged in the context of Shiny applications.
11:30	Big/high dimensional data	Norman Matloff Tiffany Jiang Wenxuan Zhao Robert Tucker	prVis: a Novel Method for Visual Dimension Reduction	Pierre Neuvial	Caravelle 2
Visualization is an invaluable tool for gaining insight into complex high-dimensional data. Here we introduce prVis: a polynomial-based 2-D visualization method. PrVis, or polynomial visualization, captures potential nonlinear relationships in datasets by performing polynomial expansion and applying PCA on the resulting expanded polynomials. Our method produces an arguably better outcome on the famous Swiss Roll test case than do t-sne, UMAP, and ordinary PCA, and generally does as well or better than those methods on other datasets. PrVis offers options to further investigate how the clusters of data relate to the original variables. The prVis software package includes several features to support easier interpretation of visualized data, including color coding for discrete and continuous variables, alpha blending, adding row numbers to data points, and removing outliers. PrVis also provides users with options for random sampling and memory-mapping to better work with large datasets.
11:48	Big/high dimensional data	Benoit Liquet	PLS for Big Data: A Unified Parallel Algorithm for Regularized Group PLS	Pierre Neuvial	Caravelle 2
Partial Least Squares (PLS) methods have been heavily exploited to analyze the association between two blocks of data. These powerful approaches can be applied to data sets where the number of variables is greater than the number of observations and in the presence of high collinearity between variables. Different sparse versions of PLS have been developed to integrate multiple data sets while simultaneously selecting the contributing variables. Sparse modeling is a key factor in obtaining better estimators and identifying associations between multiple data sets. The cornerstone of the sparse PLS methods is the link between the singular value decomposition (SVD) of a matrix (constructed from deflated versions of the original data) and least squares minimization in linear regression. We review four popular PLS methods for two blocks of data. A unified algorithm is proposed to perform all four types of PLS including their regularised versions. We present various approaches to decrease the computation time and show how the whole procedure can be scalable to big data sets. The bigsgPLS R package implements our unified algorithm and is available at https://github.com/matt-sutton/bigsgPLS
12:06	Big/high dimensional data	Sarah Romanes	multiDA and genDA: Discriminant analysis methods for large scale and complex datasets	Pierre Neuvial	Caravelle 2
Classification problems involving high dimensional data are extensive in many fields such as finance, marketing, and bioinformatics, with unique challenges with such datasets being numerous and well known. We first introduce the R package multiDA (https://github.com/sarahromanes/multiDA), a new method of performing high dimensional discriminant analysis. Starting from multiclass diagonal discriminant analysis classifiers which avoid the problem of high dimensional covariance estimation we construct a hybrid model that seamlessly integrates feature selection components. We compare our method with several other statistical machine learning packages in R, showing marked improvements in regard to prediction accuracy, interpretability of chosen features, and fast run time. We then introduce genDA (in active development), a DA method capable for use in multi-distributional response data - generalising the capabilities of DA beyond Gaussian response. It utilises Generalised Linear Latent Variable Models (GLLVMs) to capture covariance structure between the different response types and provide an efficient classifier for such datasets. This model leverages the highly efficient TMB package in R for fast and accurate gradient calculations in C++.
12:24	Big/high dimensional data	Daniel Schalk	compboost: Fast and Flexible Component-Wise Boosting Framework	Pierre Neuvial	Caravelle 2
In high-dimensional prediction problems, especially in the case where the number of features exceeds the number of observations, feature selection is an essential and often required tool. Component-wise gradient boosting is a modelling technique which provides embedded feature selection for additive statistical models. It allows automatic and unbiased selection of ensemble components from a pool of - often univariate - base learners. Boosting these kinds of models maintains interpretability of effects. The R package compboost implements component-wise boosting in C++ and Rcpp. It provides a modular object-oriented system which can be extended either in R for convenient prototyping or directly in C++ for optimized speed, the latter at runtime without full recompilation of the framework. The package also allows visual inspection and selection of effects, feature importance, risk behaviour and optimal number of iterations. In contrast to the well-known mboost package, compboost is more suitable for larger datasets and also easier to extend, whereas compboost currently lacks some of the large functionality mboost provides. Project page: https://github.com/schalkdaniel/compboost
12:42	Big/high dimensional data	Robin Genuer	How to speed-up VSURF (Variable Selection Using Random Forests)?	Pierre Neuvial	Caravelle 2
The VSURF package is a general tool to perform variable selection for regression or supervised classification problems. It implements a fully data-driven variable selection procedure based on random forests (RF). Its overall run-time depends obviously on the data characteristics (number of observations, n, and number of variables, p, mostly), but also on the computational performance of the randomForest package (since VSURF is heavily based on it). In the last few years, other packages implementing RF (ranger, Rborist) have emerged, and their authors claim that they are faster than randomForest: in the high-dimensional case (p very large and larger than n) for ranger, and in the Big data context (n extremely large) for Rborist. Therefore, one way to speed-up VSURF is to use those other packages instead of randomForest in the procedure. In this talk, RF basics are first given, then the VSURF procedure is detailed, and finally different implementations of RF into VSURF are compared in several frameworks and on different examples, to shed light on when each implementation can or cannot alleviate the overall computation burden.
11:30	Time series data	Matthias Bannert	timeseriesdb - Manage, Process and Archive Time Series with R and PostgreSQL	Achim Zeileis	Ariane 1+2
The timeseriesdb framework maps R time series representations to PostgreSQL key-value pair storage and links data descriptions in a relational manner. Combining key-value pairs with relational database features results in a light-weight but powerful architecture that keeps maintenance at a minimum as it allows to store a large number of records without the need for multiple partitions. The timeseriesdb framework was tailored to the needs of official and economic statistics with long term data conservation in mind: It handles data revisions (vintages), release dates and elaborate, multi-lingual meta information. Apart from implementation insights, this talk discusses how data management in a well etablished, enterprise level, relational database as opposed to file based management opens up the opportunity to use standard web technologies such as REST to easily build interfaces and expose time series data online.
11:48	Time series data	Rob Hyndman	A feast of time series tools	Achim Zeileis	Ariane 1+2
Modern time series are often high-dimensional and observed at high frequency, but most existing R packages for time series are designed to handle low-dimensional and low frequency data such as annual, monthly and quarterly data. The feasts package is part of new collection of tidyverts packages designed for modern time series analysis using the tidyverse framework and structures. It uses the tsibble package to provide the basic data class and data manipulation tools. feasts provides Feature Extraction And Statistics for Time Series, and includes tools for exploratory data analysis, data visualization, and data summary. For example, it includes autocorrelation plots, seasonality plots, time series decomposition, tests for units roots and autocorrelations, etc. I will demonstrate the design and use of the feasts package using a variety of real data, highlighting its power for handling large collections of related time series in an efficient and user-friendly manner.
12:06	Time series data	Christoph Sax	tsbox: Class-Agnostic Time Series	Achim Zeileis	Ariane 1+2
The R ecosystem knows a vast number of time series classes: ts, xts, zoo, tsibble, tibbletime or timeSeries. The plethora of standards causes confusion. As different packages rely on different classes, it is hard to use them in the same analysis. tsbox provides a set of tools that make it easy to switch between these classes. It also allows the user to treat time series as plain data frames, facilitating the use with tools that assume rectangular data. The package is built around a set of functions that convert time series of different classes to each other. They are frequency-agnostic, and allow the user to combine multiple non-standard and irregular frequencies. Because coercion works reliably, it is easy to write functions that work identically for all classes. So whether we want to smooth, scale, differentiate, chain-link, forecast, regularize or seasonally adjust a time series, we can use the same tsbox-command for any time series class. The talk provides a general overview of time series classes in R and shows how tsbox can be used to facilitate the interaction between them.
12:24	Time series data	Alain Quartier-La-Tente	RJDemetra: an R interface to JDemetra+ seasonal adjustment software	Achim Zeileis	Ariane 1+2
More and more infra-annual statistics are produced to evaluate the short-term evolution of an economy (GDP, chocolate sales...) or even to analyse in detail the downloads of CRAN packages! Most of these series are affected by seasonal and trading days effects that must be removed to perform temporal and spatial comparisons. RJDemetra (https://github.com/jdemetra/rjdemetra and soon on CRAN) is an R interface to JDemetra+, the European seasonal adjustment software officially recommended by Eurostat and the European Central Bank and used by more than 80 institutes in the world. This package offers access to all options and outputs of JDemetra+ that implements the two leading seasonal adjustment methods TRAMO/SEATS and X-13ARIMA-SEATS and their pre-adjustment methods (automatic RegARIMA modelling, outlier detection, trading days adjustment). The presentation will highlight the main options of RJDemetra and show how all the resources available in R can be used to improve the production of seasonal adjusted series (producing dashboards, quality reports...) or to carry out studies.
12:42	Time series data	Steffen Moritz	Experiences from dealing with missing values in sensor time series data	Achim Zeileis	Ariane 1+2
Sensors are omnipresent in our modern world. From manufacturing industry, finance, up to biology in nearly every domain sensors provide us with essential information. Although everybody seems to be well-equipped with sensors, operating these can be error prone. Especially missing values are a problem often encountered. Since follow-up analysis and processes often require complete data, missing data replacement ('imputation') is needed as a preprocessing step. In this talk we will give insights about our experiences on handling missing data in several industry related projects. We will also give a short introduction to the imputeTS package, which we developed as a result of these experiences. Imputation itself is a well-established area in statistical surveys and CRAN offers a surprisingly large choice of packages (i.a. 'mice', 'missMDA', 'Amelia'). However, there are differences between imputation for statistical surveys and sensor time series. We will discuss typical problems arising with missing data imputation in sensor time series and also present for which kind of data / missing data patterns / missing data mechanisms imputation proved to be successful. Overall, this talk is supposed to provide a first glance at sensor data imputation in R along with giving an introduction to the imputeTS package.
11:30	Contribution & collaboration	Jakub Nowosad Robin Lovelace	How to win friends and write an open-source book	Hadley Wickham	Concorde 1+2
Over the last few years, a quiet revolution has been brewing in the R book publishing industry. Since the publication of early open source books, such as Advanced R (published in 2014), many authors have switched to a hybrid system, in which books are published online and in print. This approach has several advantages, including (1) people can choose how to read the book, online or in print, and make an informed decision before buying it; (2) the book retains the process of reviews and professional copy-editing provided by publishers; (3) the wider community can contribute, leading to many improvements in the code and text. Several developments have enabled this revolution, including the bookdown package, online version control services (e.g., GitHub), continuous integration services (e.g., Travis CI), and many more. This talk will share our experience of writing the open source book Geocomputation with R (https://geocompr.github.io/). We will show how to start writing an open-source book, which tools are useful, and how to collaborate on a large project while being a continent apart. We will also point out a few issues in the process, and how we navigated them. Overall, we advocate writing an open source book and hope the talk will be useful to others thinking about starting such a project.
11:48	Contribution & collaboration	Ioannis Kosmidis	Making sense of CRAN: Package and collaboration networks	Hadley Wickham	Concorde 1+2
The Comprehensive R Archive Network (CRAN) is a diverse software ecosystem that is growing fast in size, both in terms of packages and package authors. The numbers at the time of writing this abstract are that CRAN involves more than 13.7K packages from about 19.2K authors. Keeping track of the ecosystem is, at least to me, hard, particularly in terms of effectively searching packages and authors through the network, summarising it and discovering new package and author associations. In this talk, I will introduce the cranly R package (https://github.com/ikosmidis/cranly) which streamlines and simplifies the analysis of the CRAN ecosystem. In particular, we will cover cranly's capabilities for cleaning up and organising the information in the CRAN package database, building package directives networks (depends, imports, suggests, enhances, linking to) and collaboration networks, and computing summaries and producing interactive visualisations from the resulting networks. I will also touch upon modelling, particularly on how directives and collaboration networks can be coerced to igraph (https://CRAN.R-project.org/package=igraph) objects for further analyses and modelling.
12:06	Contribution & collaboration	Patrice Kiener	RWsearch: a package for CRAN users and task view maintainers	Hadley Wickham	Concorde 1+2
Navigating the R Package Universe was intensively discussed at useR! 2017 and is still a hot topic. Ordinary R users expect to find the best packages that match their needs whereas task view maintainers have to identify all packages that fall in the scope of their task views. With the size of CRAN, the task is immense and a dedicated search engine is required. Some new solutions appeared in 2018 and 2019 and will be discussed in this talk. Among them, the 'RWsearch' package was developed to facilitate the maintenance of the 'Distributions' task view (which is ranked number five by the number of referred packages) but can be used by every R users. It includes a comprehensive set of functions to download the packages and task view master files from CRAN, explore these files, search packages by keywords, date of publication and sophisticated options, display the results as a data.frame or as a classical text in console, txt, html or pdf outputs, download the whole package documentation in one click (a wonderful function for people who need to work offline) and, for task views, compare the results with the list of already referred packages. Since its inception, 'RWsearch' has helped us discover more than 100 new packages. Along with the CRAN exploration tools, 'RWsearch' has direct links to more than 60 external web search engines. We expect this list to grow with the contribution of everyone.
12:24	Contribution & collaboration	Riva Quiroga	Translating datasets using "datalang": the development of "datos" package for the R4DS Spanish translation	Hadley Wickham	Concorde 1+2
Not mastering English language can be a very high entrance barrier for people who want to learn how to program, specially in contexts like Latin America, where learning English is not accessible to everyone due to socioeconomic inequalities. The aim of this talk is to present the collaborative translation to Spanish of "R for Data Science" (Whickham & Grolemund, 2017), a project that the Latin American R community is currently leading as a way to shorten this linguistic gap. Unlike other translations, this one not only considers translating the text to Spanish, but also the datasets and part of the code used in the book. After briefly describing the general context of the project, the presentation will focus on how we developed "datos" (Ruiz, Quiroga, Vargas & Lepore, 2019; https://cienciadedatos.github.io/datos/), the package that contains the R4DS datasets translated. The talk will present the motivations to develop this package, the workflow used for programatically translate the datasets using "datalang" package (Ruiz, 2018; https://github.com/edgararuiz/datalang), and the challenges we have faced. Opportunities for other non-English speaking R communities will be discussed.
12:42	Contribution & collaboration	Joseph Rickert	R Consortium Working Groups	Hadley Wickham	Concorde 1+2
R Consortium (ISC) working groups are where the difficult collaborative work gets done, and there is a lot going on! In this talk, I will describe what is happening in some of the high profile working groups including the R / Medicine and R / Pharma conference planning groups; the R Validation Hub group that is developing validation standards for R packages for the pharmaceutical industry; The Census group that is working with the U.S. Census Bureau to improve the accessibility of Census Data, the R Certification group, which is attempting to establish a common certification program for proficiency in R; and the R Community Diversity and Inclusion Working Group (RCDI-WG) is to broadly consider how the R Consortium can best encourage and support diverse and inclusive community activities. Working groups are generally open to all interested individuals whether or not their companies are members of the R Consortium. I'll describe how you can get involved, or perhaps even start your own working group.
11:30	Biostatistics & epidemiology 2	Alessio Crippa	Implementation and analysis design of an adaptive-outcome trial in R	Mette Langaas	Guillaumet 1+2
The development of new drugs in oncology often requires the identification of predictive biomarkers for patients where a specific drug is more beneficial. The multiplicity of available treatments as well as the vast collection of biomarkers makes the design of clinical trials difficult. A platform trial with an adaptive design allows the investigator to learn from the accumulating data and to modify prespecified components of the trial while it is ongoing. This can improve power, reduce the number of patients and costs, treat more patients with effective therapies, correctly identifying subgroup of responding patients, and shortening the time of the trial. However, adaptive designs are more complex than traditional studies and require particular attention in their design and analysis. I will present the design and implementation of the ProBio study, a biomarker driven platform trial for improving treatment decision in patients with metastatic castrate resistant prostate cancer. I will describe how to assess the operating characteristics (type I error and power) in R through extensive simulations and calibrations. I will illustrate Bayesian methods for survival analysis employed in the evaluation of the accumulating data, and outline the advantages of using R in the diverse phases of the trial, such as producing randomization lists, generating automatic reports, and relevant dashboards.
11:48	Biostatistics & epidemiology 2	Christian Ritz Jens C. Streibig	Advances in dose-response analysis	Mette Langaas	Guillaumet 1+2
During the last few decades, dose-response analysis, which is fitting and interpreting results obtained from using fully parametric nonlinear dose-response (regression) models, has undergone dramatic changes, from cumbersome, more or less manual calculations and transformations to blink-of-an-eye operations on any laptop. The first version of the R extension package "drc" for doing dose-response analysis was developed in 2005. Originally, it was developed for fitting log-logistic models that were used routinely in toxicology and pesticide science. Subsequently, the package has been extensively revised, mostly in response to feedback from the user community. By now, it has developed into a veritable ecosystem for all kinds of dose-response analyses. Similar functionality for dose-response analysis does not exist in any other statistical software. This talk briefly reviews the development, mentinoning methods that have become obsolete, methods that have stayed in use, and new methods that have been incorporated. Changes have largely been propelled by the advent of powerful extension packages. Specifically, we will talk about modular use of multiple packages when doing dose-response analysis. We will show examples on modelling of event times, binary mixtures and species sensitivity distributions. Finally, we will outline a number of directions for future developments.
12:06	Biostatistics & epidemiology 2	Terry Therneau	The next generation of the survival package	Mette Langaas	Guillaumet 1+2
With over 650 dependent packages, two guiding lights for the survival package are that "you can't do everything" and "don't break it". As a consequence, although there has been continuous improvement in the underlying algorithms, additions of major new functionality are rare. Version 3 is an exception. Major additions include much broader support for multi-state models, cacluation of absolute risk estimates, and data checking. A design goal was to make multi-state models as easy to fit and use as a single state coxph or survfit model, and to make absolute risk estimates as easy as a Kaplan-Meier curve --- all without rocking the boat with respect to current software. Why do this? The classic triad of Kaplan-Meier, log-rank, and Cox model are reliable tools for time-to-event data that has a single endpoint. However, in the authors' own work single endpoint models have now become the minority. We will give an example from the Mayo Clinic Study of Aging where keeping track of multiple endpoints is a key part of the analysis: the progression from cognitively normal (CN) to mild cognitive impairment (MCI) to dementia; low/medium/high levels of underlying neurodegeneration, the accumulation of amyloid plaques in the brain, and of course status of alive/dead. We will illustrate the new usages, and discuss how other packages can plug into this functionality and where to find further documentation and examples.
12:24	Biostatistics & epidemiology 2	Jesse Islam	A flexible approach to time-to-event data analysis using case-base sampling	Mette Langaas	Guillaumet 1+2
In epidemiological studies of time-to-event data, a quantity of interest to the clinician and the patient is the absolute risk of an event given a covariate profile. However, time matching or risk set sampling (including Cox partial likelihood) eliminates the baseline hazard from the likelihood expression for the hazard ratios. This has to be recovered using a non-parametric method which leads to step-wise estimates of the cumulative incidence that are difficult to interpret. The analysis can be further complicated in the presence of competing events. Using case-base sampling, we explain how a single event type or competing risk analysis that directly models the hazard function parametrically can be performed using logistic or multinomial regression. This formulation thus allows us to leverage all the machinery developed for GLMs including goodness-of-fit tests and variable selection. Furthermore, because we are explicitly modeling time, we obtain smooth estimates of the cumulative incidence that are easy to interpret. We demonstrate our approach using data from patients who received stem-cell transplant for acute leukemia. We provide our method in the casebase R package published on CRAN with extensive documentation and examples.
12:42	Biostatistics & epidemiology 2	Antonio Gasparrini	The R package mixmeta: an extended mixed-effects framework for meta-analysis	Mette Langaas	Guillaumet 1+2
The package mixmeta offers an extended mixed-effects framework for meta-analysis, where potentially complex patterns of effect sizes are modelled through a flexible structure of fixed and random terms. This allows the definition of standard univariate fixed and random-effects models, but also extensions separately proposed in the literature, such as multivariate, network, multilevel, longitudinal, and dose-response meta-analysis and meta-regression. The main function mixmeta provides a simple way to specify alternative models through a formula syntax consistent with standard linear mixed-effects packages. Likelihood-based estimation procedures are implemented using an efficient and robust hybrid approach consisting of a combination of iterative generalized least squares and Newton-type algorithms. Methods functions for standard regression tasks and specific meta-analytical summaries. The package is thoroughly documented and includes illustrative examples that replicate published analyses using a variety of meta-analytical models.
14:15	Keynote	Julien Cornebise	'AI for Good' in the R and Python ecosystems	Erin Le Dell	Concorde 1+2
Drawing from fifteen years of concrete examples and true stories, from algorithmic uses (computational Bayesian statistics, then deep learning and reinforcement learning at DeepMind) to more recent and more applied "AI for Good" projects (joint with Amnesty International: detecting burned villages on satellite imagery in conflict zones, or studying abuse against women on Twitter, etc), we discuss here how the tools of several communities sometimes rival and sometimes complement each other - in particular R and Python, with a bit of Lua, C++ and Javascript thrown into the mix. Indeed, from mathematical statistics to computational statistics to data science to machine learning to the latest "Artificial Intelligence" effervescence, our large variety of practices are co-existing, overlapping, splitting, merging, and co-evolving. Their scientific ingredients are remarkably similar, yet their technical tools differ and can seem alien to each other. By the end of this talk, we hope each attendee will leave with, in the worst case some interesting stories, in a better case some more thoughts on how this multitude of approaches play together to their own strengths, and, in the best case, some intense conversations in the Q&A!