09:15 |
|
Keynote |
Joe Cheng |
Shiny's Holy Grail: Interactivity with reproducibility |
Vincent Guyader |
Concorde 1+2 |
|
Since its introduction in 2012, Shiny has become a mainstay of the R ecosystem, providing a solid foundation for building interactive artifacts of all kinds. But that interactivity has always come at a significant cost to reproducibility, as actions performed in a Shiny app are not easily captured for later analysis and replication. In 2016, Shiny gained a "bookmarkable state" feature that makes it possible to snapshot and restore application state via URL. But this feature, though useful, doesn't completely solve the reproducibility problem, as the actual program logic is still locked behind a user interface.
In this talk, I'll discuss some of the approaches that app authors have taken to achieve these ends, along with some surprising and exciting approaches that have recently emerged. These new approaches usefully decrease the implementation effort and code duplication, and may eventually become essential tools for those who wish to combine interactivity with reproducibility. |
10:25 |
|
Workflow & development |
Brent Thorne
|
Transitioning between various RMarkdown packages for workflow optimization in academic research; a graduate student's perspective. |
Henrik Bengtsson |
Concorde 1+2 |
|
Reproducible documentation and the use of RMarkdown is becoming more prevalent in the academic world. Packages built on RMarkdown such as posterdown, xaringan, pagedown, and rticles allow for most aspects of a typical research project to be produced; however, newcomers to RMarkdown can be put off or discouraged by the inconsistencies in which each type of document needs to be formatted in the '.Rmd' file. This talk will provide insight into the benefits and downfalls of fully customizable RMarkdown packages with emphasis on (1) the importance of a timely workflow for the short duration of an average graduate student program; (2) introducing students or supervisors who are new to RMarkdown, and (3) show the importance of reproducibility in every stage of a good research project. Concepts from this presentation intend to spark conversation and show the importance of reproducible document generation in all stages of a research project. |
10:30 |
|
Workflow & development |
Paul Stevenson
|
An Approach to Project Workflow for Professional Biostatistical Services |
Henrik Bengtsson |
Concorde 1+2 |
|
Large research groups commonly employ a biostatistician to work across their portfolio of research projects, however, this is not feasible for many research active clinicians and "low-profile/establishing" research groups who often struggle to access biostatistical support. Our group offers "low-barrier" initial consultations for analysis, data management and database development on a "per-project" basis that is attractive to the local health research community as it makes biostatistical expertise affordable and easily accessible. To facilitate our workflow, we have developed and refined a template project (skeleton) in R (using the "ProjectTemplate" package) that, along with version control systems, conforms to the principles of reproducible research. We have also developed R markdown templates to produce documents following our Institute's style guide. This approach allows us to streamline our workflow to expeditiously initiate projects and produce professional looking reports in multiple formats directly from the analysis package without wasting time on the non-analytical aspects of our projects; this approach is identical and successful for both simple and large-scale projects. |
10:35 |
|
Workflow & development |
Ildiko Czeller
|
ropsec: a package for easing operations security for the R user |
Henrik Bengtsson |
Concorde 1+2 |
|
Applying security best practices is essential not only for developers or sensitive data storage but also for the everyday R user installing R packages, contributing to open source, working with APIs or remote servers. However, keeping up-to-date with security best practices and applying them meticulously requires significant effort and is difficult without expert knowledge. The goal of the R package ropsec (github.com/ropenscilabs/ropsec) is to bring some of the best practices closer to all R users and enabling them to add a few more layers of security to their personal workstation and shared work. In this talk I will focus on signing commits: why you should do it and how ropsec helps you do it the right way with the lowest possible risk of making a mess of your settings. I will also highlight how can you reliably test an R package whose core functionality is changing settings outside your R project. Work on ropsec started at the 2018 ropensci unconf (unconf18.ropensci.org) and the package continuously improved since then. ropsec leverages gpg (on CRAN) that provides low-level functions for signing commits; the value added comes from the collection of high-level functions, the thorough documentation and the intuitive workflow that help R users to solve end-to-end use cases. Its aim is to mitigate the risk of doing something the user does not intend to do due to the complexity of the low level operations. |
10:40 |
|
Workflow & development |
Nicoletta Farabullini
|
compareWith - user-friendly diff viewing and VCS interaction |
Henrik Bengtsson |
Concorde 1+2 |
|
Version control systems provide an important environment for controlled code and software development. In the case of R and RStudio however, the integration of version control tools is still significantly behind what is desirable. This has become a common typical concern for the whole community, especially with the uprise of Github and git for open-source R package development, where issues are being tackled through separate branches and contributors are more numerous and heterogenous.
Command-line git interaction can be an additional barrier and so individuals and organizations have sought different ways to deal with the shortcomings. In this talk, we propose a flexible light-weight combination with Meld (http://meldmerge.org/), an open-source visual diff and merge tool, and demonstrate "compareWith" (https://github.com/miraisolutions/compareWith), an R package that allows users to interact with Meld from within RStudio.
compareWith provides user-friendly addins that enable and improve tasks that are otherwise difficult or impossible without any custom extension. Examples include: i) compare differences prior to commit, for single active files or the whole project; ii) resolve and merge conflicts via three-way comparison; iii) compare 2 distinct files with each other. Even simple tasks benefit from the improved diff viewing tool compared to what is built into RStudio. |
10:45 |
|
Workflow & development |
Hannah Frick
|
goodpractice - A Tool for Good Package Development |
Henrik Bengtsson |
Concorde 1+2 |
|
Building an R package is a great way of encapsulating code, documentation and data, in a single testable and easily distributable unit. Whether you work by yourself or with others, the goal is always to keep your code easily maintainable and bug-free. R CMD check offers a set of checks on the source code of a package to ensure a quality standard required for packages on CRAN. However, it does not cover other aspects of writing good quality software such as code complexity (1) and does not require testing. The goodpractice package leverages several R packages addressing these aspects and, in one place, calculates code coverage (via covr) and cyclomatic complexity (via cyclocomp), runs linters (via lintr), includes all checks from R CMD check (via rcmdcheck) and gives further advice on good practices for R packages, e.g., to include a URL for a bug tracker. The package currently contains 230 checks to help package developers write high quality packages to a common standard. It is both configurable and extensible, so you can use it with your custom set of checks.
(1) TJ McCabe (1976) A Complexity Measure. IEEE Transactions on Software Engineering (SE-2:4). |
10:50 |
|
Workflow & development |
Jakob Richter
|
rt - R Tools for the Command Line |
Henrik Bengtsson |
Concorde 1+2 |
|
rt - R tools for the command line - is an R package containing a collection CLI programs to simplify R package development, daily routines and managing the R user library. It runs on Unix-alikes, macOS and Windows. rt packs many basic operations in handy command line instructions, effectively isolating tasks in a separate R process, to not stand in the way of your coding experience. Developing R packages often is a tedious matter involving many repetitive tasks, which mainly are testing and checking. Reducing the effort for those tasks sums up to a more efficient workflow and more available time for coding. rt allows you to test, check, build and spellcheck packages, upload them to the winbuilder and rhub from the command line. Simple routines like installing packages from CRAN or GitHub, knitting documents, starting shiny apps and updating the package library can be run directly from the shell. For users that maintain R installations on multiple machines, a dotfile configuration allows them to keep the package libraries synchronous. Project Page: https://github.com/rdatsci/rt |
10:25 |
|
Text mining |
Dmytro Perepolkin
|
{polite} - web etiquette for R users |
Riva Quiroga |
Cassiopée |
|
Data is everywhere, but it does not mean it is freely available. What are best practices and acceptable norms for accessing the data on the web? How does one know when it is OK to scrape the content of a website and how to do it in such a way that it does not create problems for data owner and/or other users? This lightning talk with introduce {polite} package – a collection of functions for safe and responsible web scraping. The three pillars of {polite} are seeking permission, taking slowly and never asking twice. |
10:30 |
|
Text mining |
Samuel Borms
|
The R Package sentometrics to Compute, Aggregate and Predict with Textual Sentiment |
Riva Quiroga |
Cassiopée |
|
We provide a hands-on introduction to optimized textual sentiment indexation using the R package sentometrics. Driven by the need to unlock the potential of textual data, sentiment analysis is increasingly used to capture its information value. The sentometrics package implements an intuitive framework to efficiently compute sentiment scores of numerous texts, to aggregate the scores into multiple time series, and to use these time series to predict other variables. The workflow of the package is illustrated with a built-in corpus of news articles from The Wall Street Journal and The Washington Post to create a selection of interesting aggregated text-based indices, and use these to forecast expected stock market volatility. |
10:35 |
|
Text mining |
Cécile Sauder Jean Delmotte
|
BibliographeR : a set of tools to help your bibliographic research |
Riva Quiroga |
Cassiopée |
|
The number of scientific articles is constantly increasing. It is sometimes impossible to read all the articles in certain areas. Among this great diversity of articles, some may be more interesting than others. It is difficult to select which articles are essential in a field. The contemporary way to judge the scientific quality of an article is to use the impact factor or the number of citations. However, these parameters may lead to a lack of certain articles that are not very well cited but are very innovative. It is therefore essential to ask the question of what makes an article fundamental in a field. Using the "fulltext" package in our Shiny web application we show how the analysis of a bibliography using a network is a good way to visualize the state of the art in a field. We searched for different parameters to judge scientific quality using data science approaches. Recent research has shown that the work of small research teams can lead to scientific innovations. In this sense, the analysis of scientific articles by global techniques could play an important role in the discovery of these advances. |
10:40 |
|
Text mining |
Erwan Le Pennec
|
ggwordcloud: a word cloud geometry for ggplot2 |
Riva Quiroga |
Cassiopée |
|
Word clouds provide a nice mechanism to visualize a list of weighted words. R has already two dedicated packages, wordcloud and wordcloud2, that produce respectively a base plot and a html widget. ggwordcloud has been introduced last fall to propose an alternative for the ggplot2 ecosystem. It consists mainly in a new geometry geom_text_wordcloud which places words on the ggplot2 canvas using an algorithm very similar to the one used in wordcloud. It provides some extensions: a better word scaling, the use of masks as in wordcloud2 and works with ggplot2 facet system. The code has been inspired by ggrepel and relies on RCpp so that the rendering is fast. In the lightning talk, we will present the package and some of its uses. For more details, see https://lepennec.github.io/ggwordcloud/ |
10:45 |
|
Text mining |
Chung-Hong Chan
|
Die Nutella oder Das Nutella? Grammatical Gender Prediction of German Nouns |
Riva Quiroga |
Cassiopée |
|
One of the big challenges of learning German is determining the grammatical gender of German Nouns (Genus / grammatisches Geschlecht, e.g. der Löffel, die Gabel, das Messer). For light-hearted, the rules seem to be pretty arbitrary. In this talk, I am going to show how I created a quite (or not quite) accurate model to predict the grammatical gender of German nouns using the R package keras. During the development process, I have encountered some interesting problems about the German Language and machine learning in general. |
10:50 |
|
Text mining |
Johannes Müller
|
Implementing a Classification and Filtering App for Multilingual Facebook Comments – A Use Case Of Data For Good with R |
Riva Quiroga |
Cassiopée |
|
What is the biggest challenge in data science? Some say it is messy data, others say company politics, but for us at CorrelAid one of the biggest challenges is its unused potential. Larger companies, universities or governmental organizations can afford professional data scientists, but what about civil society or NPOs? Currently, the resources to generate and use data science insights are highly imbalanced. Using R, we show how to leverage this unused potential by applying a cutting-edge multi-language classification approach.
In collaboration with Minor – a German organisation offering legal advisory for marginalised groups – we present a data for good use case. We use NLP techniques such as multi-language word embeddings (word2vec), unsupervised classification (e1071, caret) and topic modeling (stm) to enable Minor volunteers to better allocate their time. We demonstrate the implementation of an interactive Shiny Dashboard app for classifying and filtering multilingual Facebook comments, including English, Bulgarian and Arabic. The filter mechanisms first identify and filter out comments in the Facebook groups which Minor monitors. The topic model then allocates comments to the respective volunteer at Minor. As a result, volunteers can spend more time on actually helping their clients on Facebook instead of sifting through unstructured comments to find the relevant cases. |
10:55 |
|
Text mining |
Nolwenn Le Meur
|
queryMed: Linking pharmacological and medical knowledge using semantic Web technologies |
Riva Quiroga |
Cassiopée |
|
Care trajectory analysis from medical administrative information systems data requires integrating multiple medical data sources (hospital care, drug consumption...) stored in various specialized codifications (ICD10, CIP...), more or less detailed and not always intelligible at first glance by epidemiologist researchers. Tools and methods are needed to facilitate the annotation and the integration such codes to decipher and compare care trajectories. Because medical data are often codified according to international nomenclatures, they can be linked to knowledge representations from medical and pharmacological domains. Linked Data initiatives and Semantic Web technologies have led to the spread of knowledge representations through ontologies, thesauri, taxonomies and nomenclatures. However, the use of these standards, technologies and knowledge representations is still hesitant for non-computer scientists as non-trivial to organize and query for pharmaco-epidemiologist researchers.The R package queryMed (https://github.com/yannrivault/queryMed) provides user-friendly methods to access biomedical knowledge resources from the Linked Data. It proposes methods for the enrichment of health care data to facilitate care trajectory analysis (e.g., pharmaco-surveillance, pharmaco-vigilance), making the most of ontologies or nomenclatures properties. |
10:25 |
|
Spatial & time series |
Enrico Spinielli Tamara Pejovic
|
R in the Air |
Chris Prener |
Caravelle 2 |
|
Aircraft trajectories are becoming more available both publicly via ADS-B data crowd-sourced by aviation enthusiasts and non-profit organisations such as OpensSky Network and ADSBexchange, and also commercially via platforms such as FlightRadar24 and Flightaware. The trrrj R package supports import, export and analysis of aircraft trajectories (4D: 3D plus time) in the various phases of a commercial flight. The package also provides support for spatial analysis and plotting horizontal and vertical profiles. We present a real use case of trrrj application for arrival flights to several European airports by dealing with the assessment of inefficiencies (time, distance flown, fuel/CO2 emission) related to holdings. |
10:30 |
|
Spatial & time series |
Piotr Wójcik
|
Measuring inequalities from space. Analysis of satellite raster images with R |
Chris Prener |
Caravelle 2 |
|
Data on night-time light intensity is increasingly used by social science researchers as a proxy for economic development. Calculated from weather satellite recordings, it provides annual data for the whole globe in gridded format with pixels covering less than one square kilometer. This allows researchers to aggregate these data at the level of subnational units and analyze it together with other socio-economic indicators. Satellite images are freely available as large raster files. The aim of this presentation is to show how to analyze these data step by step in R – starting from importing the data to R, then correctly imposing a map of a selected area (e.g. shapefile) on the raster object, limiting the satellite image to the selected spatial extent and finally aggregating the data for the analyzed territorial units and visualizing the result. In addition, correlation between night-time light intensity and selected socio-economic indicators (e.g. population, GDP) will be analyzed for world countries, US states and UE regions (NUTS2 and NUTS3). R packages: dplyr, sf, raster, ggplot2, leaflet, WDI, eurostat |
10:35 |
|
Spatial & time series |
Florence Carpentier
|
SILand: an R package for estimating the spatial influence of landscape |
Chris Prener |
Caravelle 2 |
|
In ecology or in epidemiology, understanding how landscape structures the spatial distributions of species and populations is crucial to determine management rules for sustainable systems in agriculture and conservation biology. However, identifying landscape effects remains difficult, especially since no easy-to-use tools are available for this type of analysis. Here we present 'SILand', an R package. It is the first user-friendly tool that permits a complete and complex analysis of landscape effects, using few functions based on a "classic" syntax similar to well-known packages (such as stat and lme4). It can be used for the following: (i) quickly import GIS files in R; (ii) infer the spatial influence of landscape variables and the area of significant influence of each landscape variable and (iii) provides highly informative visualizations of the results (prediction maps). |
10:40 |
|
Spatial & time series |
Rodelyn Jaksons
|
Spatio-temporal Analysis of Diabrotica Emergence |
Chris Prener |
Caravelle 2 |
|
Diabrotica or more commonly known as the cucumber beetle or corn rootworm is a species of beetle. It is a major a pest and is known to cause major economic damage to corn growers. Diabrotica was previously only found in the United States and some parts of South America but is believed to have been introduced to Europe during the Yugoslav wars. Due to its potential devastating economic repurcussions it is important to understand the population life cycle of the beetles to ensure adequate controls and measures are in place to minimise impact. Through the use of the Gompertz curve and Bayesian Hierarchical Models in R, we can model the observed dynamics of the beetles emergence to infer when emergence starts and how space, time, and climactic factors affect the dynamics. |
10:45 |
|
Spatial & time series |
Annette Scheffer
|
Navigating spatial data management and analysis in Sustainable Fisheries using a combined R-Python approach |
Chris Prener |
Caravelle 2 |
|
Geospatial data is of increasing importance in resource management, impact assessment and decision-making. However, the variety of requirements and user experience when working with spatial data - from developing data visualisations through graphical interfaces to undertaking analyses with customized coding - necessitates tailored solutions to maximize accessibility of data and analysis tools for different user groups. The Marine Stewardship Council (MSC) is a non-profit organization providing an internationally recognized sustainable seafood ecolabel and fishery and traceability certification programs. The nature of such an organization means that multiple departments have different requirements and coding proficiency, requiring that the MSC's spatial data management and analysis strategy accommodates a variety of access and analysis tools. Here, we discuss the implementation of spatial data storage, organisation and analysis at the MSC in light of maximizing accessibility, workflow efficiency and reproducibility of results. Our implementation combines open-source analysis tools such as R, Python and qGIS together with PostgreSQL for spatial data management. R and Python scripts combined under the rPython package allow connecting to the spatial database via R or qGIS and automating repeated processes such as specific queries and analyses. |
10:50 |
|
Spatial & time series |
Kim Antunez
|
Dealing with the change of administrative divisions over time |
Chris Prener |
Caravelle 2 |
|
The administrative divisions of countries change over time, making it tricky to combine territorial databases from different dates. I will present two packages which help to solve this problem:
1) COGugaison [1], which provides functions for converting a spatial dataset from one year to what it would be (or have been) another year. The package handles when the territories gather and split.
2) CARTElette [2], which contains geographical layers corresponding to the annual division of French territories that can be loaded directly into R's popular {sf} format. Thanks to these two packages, it is for example possible to look at the evolution of the number of women in France in each French department since 1975 taking into account that some of the territories have changed during this period [3]. At this time, those packages only concern France and are therefore only documented in French. By presenting this problem and my current work internationally, I hope to inspire future extensions to other countries and collaborations with international spatial analysts.
[1] https://github.com/antuki/COGugaison
[2] https://github.com/antuki/CARTElette
[3] https://antuki.github.io/slides/180306_RLadies_COGugaison_carto/180306_RLadies_COGugaison_carto.html#51 |
10:55 |
|
Spatial & time series |
Gregor De Cillia
|
persephone, seasonal adjustment with an object-oriented wrapper for RJDmetra |
Chris Prener |
Caravelle 2 |
|
The R-package persephone is developed to enable easy processing during the production of seasonally adjusted estimates. It builds on top of RJDmetra and provides analytical tools such as interactive plots to support the SA expert. The package should make it easy to construct personalized dashboards containing selected plots and diagnostics. Furthermore it will support hierarchical time series and tackle the issue of direct vs indirect adjustment. |
10:25 |
|
Open science, education & community |
Saras Windecker
|
Open-access software for research: beyond data analysis |
Shelmith Kariuki |
Saint-Exupéry |
|
Amidst growing recognition of irreproducibility in the scientific literature, we face concerns about the credibility of research and the reliability of decisions they inform. Open science practices, such as writing open-access software, encourage research practices that are both transparent and repeatable. Although open-source R packages for data analysis are increasingly common, proprietary software and black-box approaches are still common for many data collection and processing procedures. In this talk I will briefly present "mixchar", a R package alternative to financially restricted "point-and-click" methods to estimate carbon in plant litter, and discuss software development for open science in research more generally. |
10:30 |
|
Open science, education & community |
Angela Li
|
Teaching reproducible spatial analysis in R |
Shelmith Kariuki |
Saint-Exupéry |
|
In this talk we will discuss a workshop taught to social scientists and econometricians at the Center for Spatial Data Science at the University of Chicago with little to no background in spatial data or programming. Unlike a conventional spatial statistics or analysis course, the workshop integrated learning to code in R with learning to think spatially. Researchers learned how to explore, manipulate, and visualize spatial data using recently-developed spatial packages in R, while at the same time learning habits for project management and reproducible research. This talk discusses pitfalls and success stories from the workshop, along with considerations when putting together a spatial data curriculum in R. It describes how teaching researchers led to increased programming literacy among researchers, as well as contributions to open source spatial packages. Finally, it puts forth suggestions for a spatial data curriculum that can be shared openly to teach spatial thinking in R worldwide. |
10:35 |
|
Open science, education & community |
William Chase
|
Use aRt to learn algorithms, math, and R |
Shelmith Kariuki |
Saint-Exupéry |
|
I'm bad at math; algorithms look like a foreign language to me; and until recently, I thought of R as a tool just for statistics. All of that changed when I discovered generative art. Almost overnight I went from being afraid of math to dreaming of the Mandelbrot set and reading papers on Wang tiling algorithms. I desperately wanted to make my own art, but I had just become comfortable with R, and the idea of learning Processing or Javascript was daunting. So I barrelled forward with R, transforming my attitude towards coding and what is possible with R. In this talk I will take useRs along my journey from math-phobe to algorithm evangelizer through my "12 Months of aRt" project (which they can read about on my blog williamrchase.com). I will discuss how the beauty of generative art engaged me more than any math class, and I will inspire useRs to do something fun with R and learn in the process. Along the way, we will learn how R is a fully capable creative coding environment, and how you can leverage a wide breadth of tools such as the tidyverse, spatial libraries, and even Rcpp to turn your creative visions into reality. |
10:40 |
|
Open science, education & community |
Beatriz Milz
|
The evolution and importance of the R-Ladies São Paulo chapter in Brazil |
Shelmith Kariuki |
Saint-Exupéry |
|
R-Ladies is an worldwide organization that promotes gender diversity in the R community. Brazil is a developing country in Latin America, that currently has 9 R-Ladies chapters, and the one in Sao Paulo was created in August 2018. By February 2019, it had almost 300 members, showing an impressive growth in such a small period of time. The chapter already held 7 events (four 3 hour meetups, one datathon and 2 whole day workshops), and other meetings are being planned. All of the content presented is made available on github, so anyone can access it. The group provides a safe environment for people interested in R, and all of the activities are free to attend. The gratuity of the activities is very relevant in the Brazilian context, since the few R courses available in Portuguese are often expensive, and there is no other active R group in Sao Paulo. The chapter also collaborates with other projects, through lectures and workshops, also always open to the community. The increasing popularity of the group shows us how important it is to support it, what can be the key to motivate the creation of other chapters in Brazil and increase the strength of the Brazilian R community. |
10:45 |
|
Open science, education & community |
Binod Jung Bogati
|
Building Active Community at Your Place |
Shelmith Kariuki |
Saint-Exupéry |
|
The strength of R comes from its community. It's easy to get involved as a member into community (if it exists). However, building a new community or making a community active is not easy. Here, I'll share my experience (as a student) on building first R community in Nepal (now 350+ members). What were the shared struggles of my community? How I managed to overcome challenges and made welcoming & active community. Besides that, I'll also share how our community helps student learn R in academics & research. So, come and join me to know tips on building active & engaging community at your place. |
10:50 |
|
Open science, education & community |
Eyitayo Alimi
|
Scaling useR Communities with Engagement and Retention Models |
Shelmith Kariuki |
Saint-Exupéry |
|
In this talk, I intend to unfold the challenges local useR communities are facing, backing up my research with the following data: How many R user groups and R ladies chapters have been founded? How many are still in existence? How many couldn't survive in Africa and how many were revived? These data will be analysed to profer a lasting solution to these challenges and help nurture existing useR communities as well as new ones. All this will be capped with advice on how I have worked with more than 7 growing communities in the last 5 years. |
10:25 |
|
Biostatistics & epidemiology |
Romane Poinsot
|
A Shiny Webapp for nutritional reformulation of food products according to French front-of-pack “Nutri-Score” label. |
Toby Dylan Hocking |
Ariane 1+2 |
|
In France, over half of processed food consumed on daily basis is made by food industry. Dietary behaviors play a key role in development of chronic diseases. In order to help people make healthier choices, French government, among others actions, implemented Nutri-Score label, on a voluntary basis. This scoring adapted from Food Standard Agency score ranges from A (green, best score) to E (red, worst score), taking into account contents of 7 favorable and unfavorable common components in food. We developed a web-based Shiny app to help agribusiness professionals calculate and improve the Nutri-Score of their products. A first feature consists in loading user's product nutritional composition table and automatically calculating score for all products. Using ggplot2 and ggiraph, the user visualizes score repartition in all her/his products database, filtering according to her/his own variables. Officer allows user to pick up desired graphs for automatic report. Then, from a product and score target (from A to E) selected by user, all combinations of components (expressed as contents ranges) complying with target are generated. Only the closest combinations to the initial nutritional composition are displayed to user. She/he can also choose varying components. In this way, user gets conceivable solutions to improve nutritional quality of her/him products by saving her/him valuable time. |
10:30 |
|
Biostatistics & epidemiology |
Fiona Grimm
|
Using Shiny to track winter pressures in the UK National Health Service (NHS) |
Toby Dylan Hocking |
Ariane 1+2 |
|
The NHS in England is under considerable pressure during winter. Within the context of existing funding pressures, demand for hospital care increasingly exceeds the capacity of emergency departments. In recent years, performance targets have consistently been missed on a national level with potentially worrying consequences for care quality and safety. This trend has also received growing media and political attention. Throughout winter the NHS regularly releases provider-level data on performance indicators, such as A&E waiting times and hospital bed occupancy, which are key to understanding quality of care and to inform future planning efforts. However, partly due to inconvenient formatting of the spreadsheets, it takes considerable analytical skill and effort to routinely produce aggregate metrics, examine trends and assess regional variation. We have developed a Shiny app as an interface for visualisation and comparison of a range of NHS performance indicators over winter, aimed at the public, the media and NHS analysts (to be released before useR). The app also shows historical context and the option to aggregate indicators within local areas. With this we aim to provide a convenient, consistent and consolidated way of tracking NHS winter performance indicators. We also want use it as a case study and learning resource to promote the use of R within the NHS via the NHS-R community. |
10:35 |
|
Biostatistics & epidemiology |
Thomas Petzoldt
|
antibioticR: An R package to identify resistant populations in environmental bacteria |
Toby Dylan Hocking |
Ariane 1+2 |
|
Antibacterial agents have made modern medicine possible. However, the dramatic increase of resistant and multiresistant bacteria is now recognized as a global challenge for human health. Phenotypic resistance can be measured in growth experiments where bacterial isolates are cultivated under drug exposure. This can be done in liquid media on multiwell plates to identify minimum inhibitory concentrations (MIC), or as diffusion test on an agar dish, where the diameter of the inhibition zone (ZD) is recorded. This is repeated with a large number of strains, because environmental populations are composed of different geno- and phenotypes. The MIC or ZD values form multi-modal distribution mixtures.
Package antibioticR (https://github.com/tpetzoldt/antibioticR) implements methods to separate sub-populations from environmental samples and to estimate distribution parameters and quantiles. It provides:
1. Kernel density smoothing to estimate location parameters and an initial guess of variance,
2. A web-based (shiny) implementation of the ECOFFinder algorithm (Turnidge et al, 2006),
3. Maximum likelihood estimation of multi-modal normal and exponential-normal mixtures.
The package analyzes sensitivity, tolerance and resistance on a sub-acute level to compare populations of different origin. The package contains also visualization tools and interactive web-applications. |
10:40 |
|
Biostatistics & epidemiology |
Daniela Mariosa
|
MR studies in R: how to use genetic information for identifying modifiable risk factors |
Toby Dylan Hocking |
Ariane 1+2 |
|
Mendelian randomization (MR) is a powerful approach to study causality by using germline genetic variants associated with a risk factor (exposure) as instrumental variables for the risk factor itself. The growing availability of results from large genome-wide association studies, not only for clinical outcomes but also for lifestyle exposures, makes MR analyses based on summary genetic data relevant for many exposure-outcome relationships. Properly conducting MR studies in R requires several steps that involve packages for both estimation and data visualization. We will present how to best exploit R to perform and present two-sample MR analyses with an example of our work on the role of obesity-related factors in cancer risk. The approach include the harmonization of the genetic information for the exposure and outcome, the estimation of the causal effect of the exposure on the outcome using different estimators, and a number of complementary analyses for the evaluation of potential pleiotropy, heterogeneity, assumption violations, and bias. The wide range of techniques required to conduct a robust MR analysis is reflected in the use of both largely used and MR-specific packages. |
10:45 |
|
Biostatistics & epidemiology |
Volha Tryputsen
|
Streamlining complex analyses of in-vivo data with INVIVOLDA shiny application |
Toby Dylan Hocking |
Ariane 1+2 |
|
In vivo studies are crucial to the development of novel therapies. In vivo data is important for proof-of-concept validation, FDA applications and clinical trials. Appropriate data analysis and interpretation are essential in providing the knowledge about the drug efficacy and safety within a living organism. With drug discovery science moving forward at an ever-accelerating rate analyses software not always capable to offer appropriate analysis suit. In vivo scientists at Janssen R&D needed comprehensive analysis tool to conduct appropriate and efficient analyses of in vivo data to insure quality and speed of decision-making. INVIVOLDA shiny application was developed to fulfill the gap. INVIVOLDA offers: powerful Linear Mixed Effect modeling for evaluating differences between treatments over time; Buckley-James method is used to infer about mean treatment differences when response is non-linear in the presence of censoring; survival analysis enables evaluation of time-to-event data. Furthermore, interactive and animated graphics allow users to conduct independent and thorough data explorations. INVIVOLDA allows streamlining complex statistical analyses of in-vivo longitudinal data by utilizing modern graphics, appropriate modelling techniques and report generation and insures efficient, traceable and reproducible in vivo research and data-driven decision making. |
10:50 |
|
Biostatistics & epidemiology |
Aritz Adin
|
A shiny web application for disease mapping. Making easy the fit of spatio-temporal models. |
Toby Dylan Hocking |
Ariane 1+2 |
|
Spatial and spatio-temporal analyses of count data are crucial in epidemiology and other fields to provide accurate estimates of mortality and/or incidence risks, and unveil the underlying spatial and spatio-temporal patterns. However, fitting spatial and spatio-temporal models is not easy for non-expert users. Here, we present the interactive web application SSTCDapp for the analysis of spatial and spatio-temporal mortality (or incidence) data, which is addressed at https://emi-sstcdapp.unavarra.es/. The web application is designed to perform descriptive analyses in space and time of mortality risks or rates, and to fit an extensive range of fairly complex spatio-temporal models commonly used in disease mapping. The application is built with the R package shiny and relies on the well founded integrated nested Laplace (INLA) approximation technique for model fitting and inference. Unlike other software used in disease mapping, SSTCDapp provides an user-friendly interface that facilitates the fit of complex statistical models to non-experts users without the need of installing any software in their own computers, since all the analyses and computations are made in a powerful remote server. In addition, a desktop version is also available to run the application locally if needed, which avoids uploading the data to the online application fully guaranteeing data confidentiality. |
11:30 |
|
Data mining |
Eric Lecoutre
|
Machine Learning with R: do it with a framework |
Sigrid Keydana |
Cassiopée |
|
There is no doubt that R is the swiss knife for modeling activities. Variety of families of models and implementations in numerous packages talks for itself. Interested R users will read the Machine Learning task view and discover (some of) those packages. This talk is not about models themselves but about frameworks. Some meta-packages indeed do not at all contain modeling functions but act as wrappers around existing assets and provide high level functionalities (cross-validation, hyperparameters grid search, staking...). Objective is to have consistent syntax with a limited set of 'verbs' as well as providing a way to implement a modeling process. We will introduce the reasons why such packages are so interesting for Data Scientists and present 3 solutions: caret, mlr and SuperLearner. In addition, as modeling pipe also requires data preparation, we will also talk about vtreat, recipes (+embed), mlCPO, sl3 and their possible integration with those frameworks. We will present some modeling flows as example and also introduce a tidy approach for modeling . |
11:48 |
|
Data mining |
Erin Ledell
|
Building and Benchmarking Automatic Machine Learning Systems |
Sigrid Keydana |
Cassiopée |
|
This talk will provide a brief overview of the field of Automatic Machine Learning (AutoML), with a focus on software and benchmarking. The term "AutoML" refers to automated methods for model selection and/or hyperparameter optimization and includes such techniques as automated stacking (ensembles), neural architecture search, pipeline optimization and feature engineering. AutoML tools are designed to maximize ease-of-use by simplifying the API. We will discuss the common AutoML software design patterns and take a detailed look at the AutoML algorithm inside of the "h2o" R package.
An important part of the development process to evolve and improve an AutoML system is a comprehensive benchmarking strategy. Benchmarking AutoML software across a wide variety of datasets allows the algorithm designer to identify weaknesses in the algorithm and software. This enables tool designers to make incremental, measurable improvements to their system over time. We will present a new open source platform for benchmarking AutoML systems which is part of the OpenML.org ecosystem for reproducible research in machine learning. The system is extensible, so anyone can write a wrapper for their software in order to benchmark it against the most popular open source AutoML systems. We will also present benchmarking results for H2O AutoML against a variety of (mostly Python-based) AutoML systems. |
12:06 |
|
Data mining |
Michel Lang
|
mlr3: A new modular framework for machine learning with R |
Sigrid Keydana |
Cassiopée |
|
The package mlr (Machine Learning with R) was released to CRAN in 2013, and its core design dates back even further. The new mlr3 package is its modular, from-scratch reimplementation in R6. Data is stored primarily as data.tables. mlr3 relies heavily on the reference semantics of R6 and data.table, which enable efficient and elegant programming on the provided machine learning building blocks. The package is geared towards scalability and larger datasets by natively supporting parallelization and out-of-memory data-backends like databases. With a clear object-oriented design, mlr3 focuses on core computational operations, while add-on packages allow seamless integration of extended functionality. For example, mlr3survival implements tasks and learners for survival analysis, and mlr3pipelines extends mlr3 with graphs of (pre)-processing operations, which can be jointly tuned with the mlr3tuning package. Project page: https://mlr3.mlr-org.com |
12:24 |
|
Data mining |
Bernd Bischl
|
mlr3pipelines: Machine Learning Pipelines as Graphs |
Sigrid Keydana |
Cassiopée |
|
mlr3pipelines is an object-oriented dataflow programming toolkit for machine learning in R6. It provides an expressive and intuitive language to define ML workflows as directed acyclic graphs that represent data flows between computational units, e.g., preprocessing, model fitting and model combination. This chains data and model manipulation steps in a modular way to form powerful data processing pipelines. Many complex ML concepts, for which special purpose packages are usually provided, can now be expressed in few lines of graph definition code: e.g., unions of feature views, bagging, stacking and hurdle models. Resulting pipelines are parameterized, so all components can jointly be tuned to obtain an optimal configuration. Graphs can contain "branching" nodes which allow selective, conditional processing of execution paths. The tuning of such tasks allows complex model selection. The modular, object-oriented concept of mlr3pipelines facilitates convenient extension with custom operations, while the compatibility with mlr3 allows convenient tuning, benchmarking, nested resampling and more. Project page: https://github.com/mlr-org/mlr3pipelines |
11:30 |
|
Programming 1 |
Scott Chamberlain
|
HTTP Requests For R Users and Package Developers |
Ildiko Czeller |
Saint-Exupéry |
|
Many R users request data from the web in their scripts and packages. This talk introduces a modern suite of packages for managing HTTP requests. The crul package is a modern HTTP request library, including asynchronous requests, automatic handling of pagination, and more. Importantly, crul provides an R6 based object system that makes it easier to program with relative to other tools. The webmockr package mocks HTTP requests, returning user specified mocked responses matching the format of the real thing. The vcr package leverages webmockr to cache HTTP requests and responses. Both webmockr and vcr support many HTTP libraries. Last, httpcode provides information on all HTTP codes, and fauxpas provides proper HTTP error classes for use in most HTTP R libraries. These tools together provide a modern way for R programmers to manage HTTP requests. |
11:48 |
|
Programming 1 |
Colin Gillespie
|
R and security |
Ildiko Czeller |
Saint-Exupéry |
|
Data science using R is increasing performed in the cloud or over a network. But how secure is this process? In this talk, we won't look at complex hacking but instead, focus on the relatively easy hacks that can be performed to access systems. We'll use three R related examples of how it is possible to access a users system. In the first example, we'll investigate domain squatting on the Bioconductor website. By registering only thirteen domains, we had the potential to run arbitrary on hundreds of users. In the second example, we'll look at techniques for guessing passwords on RStudio server instances. Lastly, we'll highlight how users can be a little too trusting when running R code from blogs. |
12:06 |
|
Programming 1 |
Jennifer Bryan
|
DRY out your workflow with the usethis package |
Ildiko Czeller |
Saint-Exupéry |
|
Usethis is one of the packages created in the recent "conscious uncoupling" of the devtools package. Devtools is an established package that facilitates various aspects of package development. Never fear: devtools is alive and well and remains the public face of this functionality, but it has recently been split into a handful of more focused packages, under the hood. Usethis now holds functionality related to package and project setup. I'll explain the "conscious uncoupling" of devtools and describe the current features of usethis specifically. The DRY concept -- don't repeat yourself -- is well accepted as a best practice for code and it's an equally effective way to approach your development workflow. The usethis package offers functions that enact key steps of the package development process in a programmatic and documented way. This is an attractive alternative to doing everything by hand or, more realistically, copying and modifying files from one of your other packages. Usethis helps with initial setup and also with the sequential addition of features, such as specific dependencies (e.g. Rcpp, the pipe, the tidy eval toolkit) or practices (e.g. version control, testing, continuous integration). |
12:24 |
|
Programming 1 |
Lionel Henry
|
Reusing tidyverse code, the easy way |
Ildiko Czeller |
Saint-Exupéry |
|
In 2017 the tidyverse grammars were reimplemented on top of tidy evaluation, a metaprogramming framework from the rlang package. Tidy eval makes it possible to program flexibly and robustly with data masking functions from packages like dplyr or ggplot2. However, possible does not mean easy. Tidy eval has the major downside of requiring to learn new programming concepts and tools. In this talk, we'll focus on easier techniques to reuse code from tidyverse pipelines without such a steep learning curve: mapping columns, using fixed column names, passing dots, subsetting the `.data` pronoun, and interpolating expressions. These techniques will help you write functions around tidyverse pipelines and reduce code duplication in your scripts. |
12:42 |
|
Programming 1 |
Davis Vaughan
|
Simple Arrays |
Dirk Eddelbuettel |
Saint-Exupéry |
|
Within the tidyverse, the core structure that powers many packages is the tibble, a modern reimagining of the data frame. Unfortunately, with the large focus on data frames, the array has been left behind. The rray package is an attempt to change that. By borrowing ideas from tibble, rray hopes to create "simpler arrays" that are more predictable to use and program around. To accomplish this, rray provides the following new infrastructure: An rray class, which never drops dimensions while subsetting, and consistently retains dimension names where possible. Broadcasting semantics, using the xtensor library. rray implements the wildly popular idea of broadcasting, originally found in the Python library, numpy, to allow more intuitive and powerful operations between multiple rray objects. This opens up a much more complete set of operations than is currently possible with base R. A consistent toolkit for common array manipulation tasks, such as computing sums and products along any axis. Each function retains dimensionality by default, making it easy to link operations together through broadcasting. Importantly, this toolkit works with base R arrays as well as with the new rray objects. https://davisvaughan.github.io/rray/ https://github.com/DavisVaughan/rray |
11:30 |
|
Models 2 |
Ghislain Vieilledent Jeanne Clément
|
Using Rcpp* packages for easy and fast Gibbs sampling MCMC from within R |
Marco Scutari |
Caravelle 2 |
|
Hierarchical Bayesian models are increasingly used for applied statistics. Parameters of such models can be estimated through Gibbs sampling Markov chain Monte Carlo using a variety of algorithms (conjugated priors, Metropolis-Hastings, Hamiltonian Monte Carlo). These algorithms approximate the parameter posterior distributions through iterative simulations and are computationally intensive. Using C/C++ language to code such algorithms make computations faster. In our presentation, we will show how Rcpp* R packages (Rcpp, RcppGSL and RcppArmadillo), can be easily used to (i) call Gibbs samplers written in C++ from within R, (ii) reduce computation time through efficient random draws, and (iii) facilitate vector and matrix operations. We will illustrate this approach with the new jSDM R package for fitting joint species distribution models. |
11:48 |
|
Models 2 |
Elias Krainski
|
A toolbox for fitting non-separable space-time log-Gaussian Cox models using R-INLA |
Marco Scutari |
Caravelle 2 |
|
Many processes have space-time non-separable dynamics (e.g. disease spread and species distribution) which should be accounted for during modeling. A non-separable stochastic partial differential approach (SPDE) can be used to consider the realistic space-time evolution of the process in which the spatial and temporal autocorrelation in the latent field are linked (Krainski 2018). Observations of these processes are often measured as point-referenced locations in time, i.e. space-time point patterns. The log-Gaussian Cox process model is a popular class to model point patterns. However, it can be difficult fit in practice because the likelihood depends on an integral over the spatial and temporal domains. Implement these models in R-INLA is challenging because it involves several steps. We provide a step-by-step approach to constructing a space-time model in R-INLA using fox rabies as a case study. We discuss several useful updates to the R-INLA package (e.g. inlabru), including improvements to the integration methods in space and time and options to improve computational performance. We also discuss several practical considerations for users to consider in model construction including mesh generation, model implementation, model checking and extensions to the basic model. The goal of this work is to help users avoid common pitfalls when constructing and interpreting these models. |
12:06 |
|
Models 2 |
Wei Jiang
|
Adaptive Bayesian SLOPE -- High-dimensional Model Selection with Missing Values |
Marco Scutari |
Caravelle 2 |
|
Model selection with high-dimensional data becomes an important issue in the last two decades. With the presence of missing data, only a few methods are available to select a model, and their performances are limited. We propose a novel approach -- Adaptive Bayesian SLOPE, as an extension of sorted $l_1$ regularization but in Bayesian framework, to perform parameter estimation and variable selection simultaneously in high-dimensional setting. This methodology in particular aims at controlling the False Discovery Rate (FDR). Meanwhile, we tackle the problem of missing data with a stochastic approximation EM algorithm. The proposed methodology is further illustrated by comprehensive simulation studies, in terms of power, FDR and bias of estimation. |
12:24 |
|
Models 2 |
Raluca Gui
|
REndo: An R Package to Address Endogeneity Without External Instrumental Variables |
Marco Scutari |
Caravelle 2 |
|
Endogeneity becomes a challenge when aiming to uncover causal relationships in empirical research. Reasons are manifold, i.e. omitted variables, measurement error or simultaneity. These might lead to the unwanted correlation between the independent variables and the error term of a statistical model. While external instrumental variables methods can be used to control for endogeneity, these approaches require additional information which is usually difficult to obtain.
Internal instrumental variable (IIV) methods address this issue by treating endogeneity without the need of additional variables, taking advantage of the structure of the data. Implementations of IIV are rare. Thereby, we propose the R package "REndo" that implements five instrument-free methods: the latent instrumental variables approach (Ebbes et al. 2005), the higher moments estimation (Lewbel 1997), the heteroskedastic error approach (Lewbel 2012), the joint estimation using copula (Park and Gupta 2012) and the multilevel GMM (Kim and Frees 2007).
This talk will focus on both, the theory behind each of the five methods proposed as well as on the practical implementation using real and simulated data. |
12:42 |
|
Models 2 |
Anne Helby Petersen
|
Discovering the cause: Tools for structure learning in R |
Marco Scutari |
Caravelle 2 |
|
Enormous amounts of observational data are being produced every day from internet users, health care providers and satellites alike. This opens up a lot of new possibilities for what observational data may be used for. But if the subject is causality, it is still common to solely rely on externally proposed, "hypothesis-driven" models, which limits the range of causal inquiries. However, in some cases it is possible to construct a causal model from the data using structure learning. This is not only relevant for those interested in inferring causality. Even when prediction is the goal, knowledge of causal structures is useful for helping domain adaption because the mechanistic nature of causal structures make them more stable. A myriad of packages for structure learning have been developed in R, including pcalg, bnstruct, bnlearn, deal, catnet and stablespec, each of them dedicated to a certain class of causal models (e.g. linear), a specific algorithmic approach (e.g. constraint-based), or a combination of both. In this presentation, I provide an overview of existing R packages for structure learning, focusing on overlaps and differences in functionality, interface and possibilities to include external information. I also discuss how the packages may be integrated into a joint tool, thereby facilitating structure learning without settling on a model class or learning approach a priori. |
11:30 |
|
Forecasting |
Mitchell O'hara-Wild
|
Flexible futures for fable functionality |
Genaro Sucarrat |
Ariane 1+2 |
|
The fable ecosystem provides a tidy interface for time series modelling and forecasting, leveraging the data structure from the tsibble package to support a more natural analysis of modern time series. fable is designed to forecast collections of related (possibly multivariate) time series, and to provide tools for working with multiple models. It emphasises density forecasting, whilst continuing to provide a simple user-interface for point forecasting. Existing implementations of time series models work well in isolation, however it has long-been known that ensembles of forecasts improve forecast accuracy. Hybrid forecasting (separately forecasting components of a time series) is another useful forecasting method. Both ensemble and hybrid forecasts can be expressed as forecast combinations. Recent enhancements to the fable framework now provide a flexible approach to easily combine and evaluate the forecasts from multiple models. The fable package is designed for extensibility, allowing for easier creation of new forecasting models and tools. Without any further implementation, extension models can leverage essential functionality including plotting, accuracy evaluation, model combination and diagnostics. This talk will feature recent developments to the fable framework for combining forecasts, and the performance gain will be evaluated using a set of related time series. |
11:48 |
|
Forecasting |
Thiyanga Talagala
|
Feature-based Time Series Forecasting |
Genaro Sucarrat |
Ariane 1+2 |
|
This work presents two feature-based forecasting algorithms for large-scale time series forecasting. The algorithms involve computing a range of features of the time series which are then used to select the forecasting model. The forecasting model selection process is carried out using a pre-trained classifier. In our first algorithm we use a random forest algorithm to train the classifier. We call this framework FFORMS (Feature-based FORecast Model Selection). The second algorithm use efficient Bayesian multivariate surface regression approach to estimate forecast error for each method, and then using the minimum predicted error to select a forecasting model. Both algorithms have been evaluated using time series from the M4 competition, and is shown to yield accurate forecasts comparable to several benchmarks and other commonly used automated approaches in the time series forecasting literature. The methods are made available in the seer and fformpp packages in R. |
12:06 |
|
Forecasting |
Benjamin Goehry Hui Yan
|
Random forests for time series |
Genaro Sucarrat |
Ariane 1+2 |
|
Random forests were introduced in 2001 by Breiman and have since become a popular learning algorithm, for both regression and classification. However, when dealing with time series, random forests do not integrate the time-dependent structure and treat each instant as an independent observation. In this study, we propose the rangerts, an extended version of the ranger package for time series.
Goehry (2018) proved under right hypotheses on parameters and the time series that random forests are consistent. In practice, the idea is to replace the IID bootstrapping with dependent bootstrapping to subsample time series during the tree construction phase to take time dependency into account.
We tested our package both on numerical simulations and the real world applications on electricity load and show our method improves forests' accuracy in some cases. We discuss also how to make a good choice of the key parameters. References Breiman L. Random forests, Machine Learning, vol. 45, no.1, pp. 5-32, 2001. Goehry B. Random forests for time-dependent processes, preprint, Available: https://hal.archives-ouvertes.fr/hal-01955331, 2018. Wright M.N. & Ziegler A. ranger: A fast implementation of random forests for high dimensional data in C++ and R, J Stat Softw 77:1-17, 2017. |
12:24 |
|
Forecasting |
Ivan Svetunkov
|
Smooth forecasting in R |
Genaro Sucarrat |
Ariane 1+2 |
|
There are several packages in R that implement forecasting using state space models, and only one that relies on a single source of error state space model ("forecast" package). Unfortunately, the forecasting functions in that package are not flexible enough for different research purposes. For example, exponential smoothing, implemented in ets() function does not allow using explanatory variables, setting the initial values of the states vector and does not allow fitting models to the data with periodicity higher than 24. This motivated the original development of the package back in 2015, with the main aim of making the research in forecasting area "smooth". Four years later, the smooth package has a handful of flexible forecasting functions useful for different research purposes, implementing ETS, ARIMA, vector exponential smoothing, simulation functions and more. In this presentation we will discuss the main functions of the package, show their advantages and disadvantages, and show how they can be applied for the solution of the real world forecasting problems and complement "forecast" and other widely used packages. |
12:42 |
|
Forecasting |
Eran Raviv
|
Forecast Combination in R |
Genaro Sucarrat |
Ariane 1+2 |
|
Introducing the R package ForecastComb. The aim is to provide researchers and practitioners with a comprehensive implementation of the most common ways in which forecasts can be combined. The package in its current version covers 15 popular estimation methods for creating a combined forecasts – including simple methods, regression-based methods, and eigenvector-based methods. It also includes useful tools to deal with common challenges of forecast combination (e.g., missing values in component forecasts, or multicollinearity), and to rationalize and visualize the combination results. |
11:30 |
|
Communities & conferences |
Dennis Irorere
|
R for Data Science Online Community |
Hannah Frick |
Concorde 1+2 |
|
The R for Data Science (R4DS) Online Learning Community was started by Jesse Mostipak in August of 2017, with the goal of creating a supportive and responsive online space for learners and mentors to come together to provide support and guidance in learning R in a beginner-friendly, safe and supportive environment. However, the main aim of the R for Data Science community was to move through the *R for Data Science* text by Garrett Grolemund and Hadley Wickham, which walks readers through the major features of the tidyverse in the context of data science. We also aim to help users learn R and expand their R knowledge. Over time R4DS science community has been developing projects intended to help connect mentors and learners. One of the first projects born out of this collaboration is #TidyTuesday, a weekly social data project focused on using tidyverse packages to clean, wrangle, tidy, and plot a new dataset every Tuesday. Over the years, the community have been experienced a steep growth with over 2000 members from various countries in the world in our slack channel. At the R4DS community, we encourage diversity and contribution from everyone. We believe that no question is silly and no one is an island of knowledge. In this talk, we will be sharing the positive changes made since the transfer of leadership, our challenges, the lesson we have learnt and provide room for suggestion and opinions. |
11:48 |
|
Communities & conferences |
Laura Acion
|
Insights from the recent R community development and growth in Latin America |
Hannah Frick |
Concorde 1+2 |
|
This talk summarizes some of the recent work that put Latin America (LatAm) in the international R user map. In January 2017, R-Ladies Buenos Aires (BA) was founded. R-Ladies BA encouraged the foundation of other R-Ladies groups in the region (e.g., Santa Rosa, Santiago). In two years, LatAm went from one to 29 R-Ladies active groups. English-only materials can be a barrier to R. Hence, LatAm R-Ladies led efforts to translate into Spanish: R for Data Science (es.r4ds.hadley.nz), R-Ladies code of conduct and policies (bit.ly/2UwP79x), and R Consortium 2017 (bit.ly/2Tpt8RR) and RStudio 2018 surveys. The creation of new regional conferences was another R-Ladies-led effort. Thanks to the R-Ladies network, LatinR (latin-r.com) got rapidly started and LatAm had its first SatRDay. Among other achievements, these events impulsed the regional community in the form of new and diverse R user groups (e.g., Rosario, Montevideo). In a nutshell, in the last two years, the LatAm R user community has seen fast growth and regrouping, mainly spearheaded by R-Ladies. This presentation details key aspects of this process and seeks to inspire other R regional communities worldwide. |
12:06 |
|
Communities & conferences |
Dennis Irorere Shel Kariuki
|
AfricaR |
Hannah Frick |
Concorde 1+2 |
|
Africa R is a consortium of passionate Africa R user groups and users innovating with R every day and are inspired to share their experience as well as communicate their findings to a global audience. This consortium was birth from the underrepresentation and minority involvement of African population in every role and area of participation, whether as R developers, conference speakers, educators, users, researchers, leaders and package maintainers. As a community, our mission is to achieve improved representation by encouraging, inspiring, and empowering African population of all genders who are underrepresented in the global R community. With a primary objective of supporting already existing R Users across Africa and R enthusiasts to embrace the full potential of R programming, through fostering a collaborative continental network of R gurus, mentors, learners, developers and leaders to help facilitate individual and collective progress worldwide. Africa R talk includes a presentation of our work plan, collaborators, partners and mentors. We will also be using this opportunity to show statistics of members, R user group in our network and launching our website. #AfricaRusers - Twitter handle. |
12:24 |
|
Communities & conferences |
Noa Tamir Colin Gillespie Riva Quiroga Vincent Warmerdam
|
The truth about satRdays (panel session, part 1) |
Hannah Frick |
Concorde 1+2 |
|
9 satRday events were organised since UseR! 2018. Join us for a panel discussoin with a diverse group of organisers from across the world. You will learn about why they chose to volunteer their time, how they approached satRday's mission, what they have learnt from organising the event, how satRday impacted their local R community, and more! |
12:42 |
|
Communities & conferences |
Noa Tamir Colin Gillespie Riva Quiroga Vincent Warmerdam
|
The truth about satRdays (panel session, part 2) |
Hannah Frick |
Concorde 1+2 |
|
9 satRday events were organised since UseR! 2018. Join us for a panel discussoin with a diverse group of organisers from across the world. You will learn about why they chose to volunteer their time, how they approached satRday's mission, what they have learnt from organising the event, how satRday impacted their local R community, and more! |
11:30 |
|
Biostatistics & epidemiology 1 |
Thibaut Jombart
|
Reproducible data science to support outbreak responses: experience from the North Kivu Ebola outbreak |
Rich Fitzjohn |
Guillaumet 1+2 |
|
The response to emerging disease outbreaks and health emergencies are increasingly data-driven, integrating various sources of information to improve situational awareness in real time. Outbreak analytics face many of the modern data science challenges, and additional difficulties pertaining to the emergency, low-resource settings characterising some of these outbreaks. In this presentation, I will outline some of these challenges, and a range of solutions developed by the R Epidemics Consortium (RECON), an NGO dedicated to developing free analytics resources for health crises. In particular, we will discuss different aspects relating to the deployment of robust, reliable reporting infrastructures in the 2019 Ebola outbreak in North Kivu, DRC. We will showcase features of new R packages dedicated to data standardisation and cleaning (linelist), automated reporting (reportfactory), and offline alternatives to online repositories such as CRAN and github for deploying R-based analysis environments (RECON deployer). I will conclude on how R can help strengthening outbreak response capacities, and more generally humanitarian work, in low and middle income countries. |
11:48 |
|
Biostatistics & epidemiology 1 |
Zhian Kamvar
|
Advancing data analytics for field epidemiologists using R: the R4epis innovation project |
Rich Fitzjohn |
Guillaumet 1+2 |
|
Data analysis is integral to informing operational elements of humanitarian medical responses. Field epidemiologists play a central role in informing such responses as they aim to rapidly collect, analyse and disseminate results to support Médecins Sans Frontières (MSF) and partners with timely and targeted intervention strategies. However, a lack of standardised analytical methods within MSF challenges this process. The R4epis project group consists of 18 professionals with expertise in: R programming, field epidemiology, data science, health information systems, geographic information systems, and public health. Between October 2018 and April 2019, R scripts were developed to address all aspects of data cleaning, data analysis, and automatic reporting for outbreaks (measles, meningitis, cholera and acute jaundice) and surveys (retrospective mortality, malnutrition and vaccination coverage). Analyses and outputs were piloted and validated by epidemiologists using historical data. The resulting templates were made available to field epidemiologists for field testing, which was conducted between February and April 2019. R4epis will contribute to the improvement of the quality, timeliness, consistency of data analyses and standardisation of outputs from field epidemiologists during emergency response. |
12:06 |
|
Biostatistics & epidemiology 1 |
Vincent Audigier
|
micemd: a smart multiple imputation R package for missing multilevel data |
Rich Fitzjohn |
Guillaumet 1+2 |
|
Multiple imputation is a common strategy to overcome the missing data issue. Several MI methods have been proposed in the literature to impute multilevel data with classical sporadically missing values only. However, methods for dealing with more complex missing data are needed. Indeed, the multilevel structure is often due to data merging (because of the heterogeneity between collected datasets), but variables often vary according to the dataset, leading to systematically missing variables. micemd is an addon for the mice package to perform multiple imputation using chained equations with two-level data. It includes imputation methods specifically handling sporadically and systematically missing values (Resche-Rigon et al. 2013, Audigier, V. et al, 2018). micemd offers a complete solution for the analysis: the choice of the imputation model for each variable can be automatically tuned according to the data structure (Audigier, V. et al, 2018), it gathers tools for checking model fitting (Blackwell, M. et al, 2015) and allows parallel calculation. The talk is motivated by a meta-analysis in cardiovascular disease consisting of 28 observational cohorts in which systematically missing and sporadically missing data (GREAT data). Then, based on a simulation study, advantages and drawbacks of each multiple imputation method are discussed. Finally, methods are compared on the GREAT data. |
12:24 |
|
Biostatistics & epidemiology 1 |
Iryna Schlackow
|
Facilitating external use with user-friendly interfaces: a health policy model case study |
Rich Fitzjohn |
Guillaumet 1+2 |
|
Health policy models are increasingly being used by clinicians, analysts and policy makers to predict patients' long-term outcomes, how these outcomes are affected by interventions, and whether the interventions are (cost-)effective and should be recommended for use. Usability of models as well as reliability and transparency of methods and results is therefore vital. Even when the code is freely available, laws preventing the sharing of sensitive individual patient data mean that the published results are still not fully reproducible. Additionally, model users may want to change input parameters, and therefore need to possess skills, and time, to understand the underlying code. We present a Shiny-based user-friendly web interface for a health policy model predicting progression of chronic kidney disease and cardiovascular complications. The interface is freely available and users can change different parameters using drop-down menus or .csv files; the output is a detailed downloadable .csv file, and a userguide is provided together with a range of templates. We discuss how, in addition to aid with the usability, such interface may help with debugging and transparency, and what the key considerations during the development are. |
12:42 |
|
Biostatistics & epidemiology 1 |
Torben Tvedebrink
|
genogeographer - a tool for ancestry informative markers |
Rich Fitzjohn |
Guillaumet 1+2 |
|
Ancestry informative markers (AIMs) are genetic markers that give information about the genogeographic ancestry of individuals. They are for example used to predict the genogeographic origin of individuals related to forensic crime and identification cases. A likelihood ratio test (LRT) approach is derived in order to prevent erroneous conclusions due to e.g. no relevant populations in a database of reference populations. The LRT is a measure of absolute concordance between a profile and a population, rather than a relative measure of the profile's likelihood in two populations by a likelihood ratio (LR). The LRT is similar to Fisher's exact test and constitutes an outlier test analogous to studentized residuals. Varying sample sizes of the reference populations in the database are explicitly included in the LRT with fewer assumptions than the LR. The LRT is extended to handle admixed profiles (parents of different genogeographic origin). The methodology has been implemented in the R package genogeographer with an optional shiny front-end, that enables forensic geneticists to make explorative analyses, produce various graphical outputs together with evidential weight computations. |
14:00 |
|
Operations & data products |
Francois Michonneau
|
How a non-profit uses R for its daily operations |
Colin Gillespie |
Concorde 1+2 |
|
The Carpentries is a non-profit organization that organizes a global community of scientists that runs workshops where researchers are taught foundational computational and data science skills. Since 2012, The Carpentries has taught 2,000+ workshops in 46 countries reaching 38,000+ learners. Here, I will present how The Carpentries uses R in its daily operations. From analyzing our survey results, to sending personalized certificates of workshop attendance to learners, from creating live-updating visualizations for our workshop instructors to understand their audience before the workshops, to our lesson templates, I will go through some examples of how R has become a central part of how we set up our systems to manage and automate our workflows within our organization. The combination of literate programming to generate reports, web application development with shiny, and the availability of packages to interact with the web API for many of the online tools and services we use, has allowed us to develop custom workflows. We can iterate quickly and deploy using continuous integration services and Docker. This talk will be of interest to organizations looking to automate their operations to demonstrate how R can successfully be used in production. |
14:18 |
|
Operations & data products |
Daan Seynaeve
|
rjenkins and rrundeck: Coordinating Continuous Integration and Delivery with R |
Colin Gillespie |
Concorde 1+2 |
|
Continuous integration is a software development practice that advocates for members of a team to merge their work frequently in a shared repository. Each integration is verified through a process of automated building and testing. This process can be facilitated through the use of a build server. A popular choice is Jenkins: a free and open source automation server. Continuous delivery is an extension of the continuous integration principle that asks that every successfully built version of the software is put into a staging environment from where it can easily be deployed to a production setting. The deployment process is still manual but it can be simplified through the use of self-service operations. Rundeck is an open source management platform that allows to define such self-service operations. We propose two new R packages: rjenkins and rrundeck. These packages interact with the Web APIs offered by Jenkins and Rundeck to easily create, trigger and monitor jobs and operations. This provides R users with an intuitive interface to these tools which can be used in an interactive or scripted way. |
14:36 |
|
Operations & data products |
Kelly Obriant
|
Advanced Git Integrations for Automating the Delivery of Reproducible Data Products in R |
Colin Gillespie |
Concorde 1+2 |
|
We know that adopting git or code version control mechanisms are important for creating a culture of reproducibility in data science. But once you've established the basic best practices in your daily workflow, what comes next? This talk will be an introduction to continuous integration and continuous delivery tools (CI/CD). I'll cover reasons why CI/CD tools can enhance reproducibility for R and data science, showcase practical examples like automated testing and push-based application deployment, and point to simple resources for getting started with these tools in a git and GitHub based environment. The target user base for advanced Git and GitHub integration tooling remains focused on software engineers and IT professionals. As data scientists lead the scientific community as a whole toward better, reproducible research practices, we need to be aware of the vast ecosystem of technology solutions that can benefit this mission. The specific tools I'll aim to cover in this short presentation are: GitHub webhooks, Jenkins, and Travis, all framed in terms of their use with R code and data products. |
14:54 |
|
Operations & data products |
Verena Held Max Held
|
GitHub actions for R |
Colin Gillespie |
Concorde 1+2 |
|
Continuous integration and delivery (CI/CD) has evolved as a software development best practice, and it also strengthens reproducibility in (data) science. GitHub actions is a new workflow automation feature of the popular code repository host GitHub. It is a convenient service layer on top of the popular container standard docker, and is itself partly open source, thus limiting vendor lock-in. GitHub actions may offer better CI/CD for the R community, but most importantly, it is simple to reason about if things go wrong. The ghactions project presented here offers three avenues to bring GitHub actions to the R community:
1. Developing and curating actions to run R-specific jobs on GitHub, including arbitrary R code or deploying to shinyapps.io.
2. Furnishing users with some out-of-the-box workflows for different kinds of R projects.
3. Documenting experiences and evolving best practices for how to make the most of GitHub actions for R. More information on the ghactions package and project can be found at: http://maxheld.de/ghactions/. |
14:00 |
|
Programming 2 |
Tomas Kalibera
|
Sustainable Package Development |
Dirk Eddelbuettel |
Cassiopée |
|
Writing and maintaining packages is an essential contribution to the R community. Despite a number of formal requirements on packages, most of the internal details of the language can be inspected and modified through reflective features and a rich C API, giving a lot of freedom to package developers. This probably contributed to the popularity of R, but poses a risk when not used responsibly.
R needs to adapt to the changing needs of its users and to changes in the software/hardware environments in which it is used. As of today, almost any change in the R runtime, however minute, breaks some packages. The causes are hard to find especially when the change is to undocumented R behavior or when it "wakes up" an old bug. Tests using all CRAN/BIOC packages are run routinely, requiring expensive hardware. Debugging requires skill, experience, knowledge of R internals and typically much more time than the implementation of the change that caused the bug. It is done by R Core and adds to the workload of repository maintainers.
This talk is an inspiration for package authors who want to develop packages responsibly, without unnecessarily increasing the cost of maintenance and evolution of R. It will include advice and examples based on my work on the R runtime (and debugging of packages) related to PROTECT bugs, in-place modification of immutable values, memory management at the C/C++ boundary, and parse data fixes. |
14:18 |
|
Programming 2 |
Elie Canonici Merle
|
Typing R |
Dirk Eddelbuettel |
Cassiopée |
|
For a long time now, programming langages have been divided in two categories, dynamically typed ones and statically typed ones.Both sides tend to argue that their system has more inherent benefits than drawbacks according to their needs. On one hand it is convenient to have a system that allows writing non well typed programs that you know to be correct. On the other hand, no programmer is ever safe from a silly mistake taking ages to be fixed, if ever detected, in his code base. Thus, with the ever growing usage of dynamically typed langages such as R or javascript, it has become increasingly important to detect mistakes as early as possible in the deveopment process. By adapting some approaches inherited from the strongly statically typed langages community we have developped a typing system for a fragment of the R programming language. We argue that it does not restrict the expresiveness of the R language beyond what is actually widely used. Moreover we have embedded a type checker in a state of the art integrated development environment leveraging the graphical interface to report useful errors to the user. |
14:36 |
|
Programming 2 |
Perry De Valpine
|
nCompiler: C++ code-generation from R code |
Dirk Eddelbuettel |
Cassiopée |
|
Many package developers boost performance by coding key steps in C++, using R's C headers and/or Rcpp. The nimble package, predecessor to nCompiler, includes a system for automatic generation of C++ for a core subset of R's math and distribution functions. nimble implements vectorized and recycling-rule operations by code-generating to the Eigen C++ library and automatic differentiation via the CppAD library (in development versions). It includes basic flow control and static data types. However, as a general R programming tool, nimble has design limitations. nCompiler is a new package, designed to be a more general programming tool, with some refactored components from nimble. nCompiler allows definition of classes that mix R and C++ (code-generated or embedded) data and methods as well as pure functions. Much numerical work becomes C++ without coding any C++ by hand. nCompiler plays well with Rcpp and harnesses its compilation tools; harnesses Eigen more deeply, including its Tensor features; supports automatic differentiation via CppAD; and is designed for embedding code in packages, parallelizing, serializing C++ objects, and providing a natural workflow. |
14:54 |
|
Programming 2 |
Zbynek Slajchrt
|
Mixed interactive debugging of R and native code with FastR and Vistual Studio Code |
Dirk Eddelbuettel |
Cassiopée |
|
Interactive debuggers are one of the most useful tools to aid software development. The two most used approaches for interactive debugging in the R ecosystem, the built-in debugger and R Studio, do not support interactive debugging of both R and C code of R packages at the same time and in one tool. FastR is an open source alternative R implementation that, apart from being compatible with GNU-R, puts emphasis on the performance of R execution and tooling support. FastR is part of the multilingual virtual machine (GraalVM) that, among other things, provides language agnostic support for interactive debugging. One of the other projects built on top of GraalVM is a C/C++ interpreter. FastR can be configured to run the native code of selected R packages using this C/C++ interpreter, which should yield the same behavior, but since both languages are now running in one system, it opens up many exciting possibilities including seamless cross-language debugging. In the talk, we will demonstrate how to configure Visual Studio Code and FastR for cross-language interactive debugging and how to debug a sample R package with native code. |
14:00 |
|
Numerical methods |
Alessandro Gasparini
|
Analysing results from Monte Carlo simulation studies using the rsimsum package and the INTEREST shiny app |
Thomas Petzoldt |
Caravelle 2 |
|
Monte Carlo simulation studies are computer experiments that involve generating data by pseudo-random sampling; they provide an invaluable tool for statistical and biostatistical research. Consequently, dissemination of results plays a focal role to drive adoption and further development of new methods. However, simulation studies are often poorly designed, analysed, reported. One of the aspects often poorly reported - often not reported at all - is the Monte Carlo error of summary statistics, defined as the standard deviation of the estimated quantity over replications. Monte Carlo errors play a crucial role in understanding how results are affected by chance.
In order to aid researchers interested in running and analysing simulation studies, we developed rsimsum and INTEREST. rsimsum is an R package for analysing results from simulation studies: it computes the most common summary statistics and Monte Carlo errors are reported by default. INTEREST is a shiny app providing an interface to rsimsum, offering tools to analyse simulation studies interactively and export plots and tables of summary statistics for later use. rsimsum and INTEREST can aid investigating results from simulation studies and supplement the reporting of their results to a great extent, allowing researches to share the full results of their simulation study and readers to explore them freely and in a more engaging way. |
14:18 |
|
Numerical methods |
Robert Crouchley
|
Algorithmic Differentiation in R using the RcppEigenAD package |
Thomas Petzoldt |
Caravelle 2 |
|
Algorithmic Differentiation (AD) has been available as a programming tool for statisticians for a number of decades. However, its adoption as an alternative to symbolic and numeric methods does not seem to be very common. One possible reason for this is the difficulty that is typically encountered when attempting to integrate AD functionality into existing statistical computing environments. The RcppEigenAD package attempts to mitigate these difficulties when employing AD within R by combining the facilities of the Rcpp package for extending R with C++, the AD library cppAD, and the Eigen linear algebra library, into a single R package. The resulting package, (RcppEigenAD), allows the user to define matrix valued functions of matrix arguments in c++ and seamlessly integrate them into an R session in a way that allows not only for computing the function but also their Jacobian and Hessian. The package also includes an implementation of Faa di Bruno's formula for calculating the partial derivatives of the composition of functions defined by the user. The package has application in the areas of optimisation, sensitivity analysis and calculating covariances via the delta method, which are illustrated with examples. |
14:36 |
|
Numerical methods |
Richard Fitzjohn
|
Describing and solving differential equations with a new domain specific language, odin |
Thomas Petzoldt |
Caravelle 2 |
|
Solving differential equations in R presents a challenge because one must choose between implementations that are either expressive but slow to compute, or cumbersome to write but faster. I present a new package "odin" for removing this tradeoff by creating a domain specific language (DSL) hosted in R that compiles a subset of R to C for efficiently expressing and solving differential equations (JavaScript and R are compilation targets under current development). By treating the set of differential equations as a directed acyclic graph, the DSL is declarative rather than imperative. "odin" leverages the well established "deSolve" package to interface with a number of well understood solvers. I present applications of "odin" both in a research context for implementing epidemiological models with 10's of thousands of equations and in teaching contexts where we are using the introspection built into the DSL to automatically generate shiny interfaces. |
14:00 |
|
Spatial data & maps |
Emmanuel Blondel
|
Strengthening of R in support of spatial data infrastructures management: geometa and ows4R packages |
Thibault Laurent |
Ariane 1+2 |
|
The amount of data to manage is increasing across institutions. Metadata plays a key role to make this data findable, accessible, interoperable and re-usable, and becomes a pillar through legal frames (eg INSPIRE) or with the emergence of data management plans (DMPs). Data managers have thus to deal with these requirements by applying standards to manage (meta)data formats and access protocols. It is especially the case for spatial information which is ruled by ISO/OGC standards. The use of R has been spreading worldwide as preferred tool for data managers and scientists. In this context, some projects were initiated to support metadata handling for specific domains (eg EML), while the capacity to produce standardized ISO/OGC geographic metadata with R was limited. The geometa and ows4R packages aim to fill this gap by providing functions to write and read ISO/OGC metadata, and interfaces to the OGC Web Services. We explain their functioning, including recent features provided with the support of the R Consortium. We present then how they contribute to several national and international information systems in different domains such as fisheries, marine monitoring, ecology and earth observation. Based on these packages, the geoflow initiative as orchestrator for spatial data management in R will be introduced to demonstrate how R can be used for managing spatial data infrastructures. |
14:18 |
|
Spatial data & maps |
Ege Rubak
|
Resample-smoothing of Voronoi intensity estimators |
Thibault Laurent |
Ariane 1+2 |
|
Voronoi estimators are non-parametric and adaptive estimators of the intensity of a point process. The intensity estimate at a given location is equal to the reciprocal of the size of the Voronoi/Dirichlet cell containing that location. Their major drawback is that they tend to paradoxically under-smooth the data in regions where the point density of the observed point pattern is high, and over-smooth where the point density is low. To remedy this behaviour, we propose to apply an additional smoothing operation to the Voronoi estimator, based on resampling the point pattern by independent random thinning. Through a simulation study we show that our resample-smoothing technique improves the estimation substantially. The proposed intensity estimation scheme is also applied to two datasets: locations of pine saplings (planar point pattern) and motor vehicle traffic accidents (linear network point pattern). Everything is implemented in R and released in the `spatstat` package available on CRAN. The oral presentation will explain the basic concepts, which are very simple, and leave out the mathematical details. Instead, focus is on the relevant objects and classes in the R implementation and how e.g. the Voronoi/Dirichlet tesselation is handled on a linear network such as a road map. |
14:36 |
|
Spatial data & maps |
Timothée Giraud
|
Thematic mapping with "cartography" |
Thibault Laurent |
Ariane 1+2 |
|
The R spatial ecosystem is blooming and dealing with spatial objects and spatial computations has never been so easy. In this context, the cartography package aim is to create thematic maps with the visual quality of those designed with classical mapping or GIS tools. The package helps to design cartographic representations such as proportional symbols, choropleth, typology, flows or discontinuities maps. It also offers several features that improve the graphic presentation of maps, for instance, map palettes, layout elements (scale, north arrow, title...), labels or legends. cartography is a mature package (first release in 2015), it has already been reviewed in both software and cartography focused journals (Giraud, Lambert 2016 & Giraud, Lambert 2017). It follows current good practices by using continuous integration and a test suite. A vignette, a cheat sheet and a companion website help new users to start using the package. In this presentation we will firstly give an overview of the package main features. Then we will develop examples of use of the package along with other spatial related packages.
Giraud, T., & Lambert, N. (2016). cartography: Create and Integrate Maps in your R Workflow. The Journal of Open Source Software, 1(4), 1-2.
Giraud, T., & Lambert, N. (2017, July). Reproducible cartography. In International Cartographic Conference (pp. 173-183). Springer, Cham. |
14:54 |
|
Spatial data & maps |
Edwin De Jonge
|
Creating privacy protecting density maps: sdcSpatial |
Thibault Laurent |
Ariane 1+2 |
|
R allows for creating beautiful maps and many examples can be found. Cartography is an indispensable tool in analyzing spatial data and communicating regional patterns to your target audience. Current data sources often contain location data making maps a natural visualisation choice. While these detailed location data are fine for analytic purposes, derived statistiscs have the risk of disclosing information of individual persons. For example if one plots the spatial distribution of income or getting social welfare, sparsely populated areas may be very revealing. Statistical procedures to control disclosure have been readily available (sdcTable, sdcMicro), but those focus on protecting tabulated or microdata, not on protecting spatial data as such. R package sdcSpatial (https://github.com/edwindj/sdcSpatial) allows for creating maps that show spatial patterns but at the same time protect the privacy of the target population. The package also contains methods for assessing the associated risk for a given data set. |
14:00 |
|
Visualisation |
Achim Zeileis
|
colorspace: A Toolbox for Manipulating and Assessing Color Palettes |
William Chase |
Saint-Exupéry |
|
The R package "colorspace" (http://colorspace.R-Forge.R-project.org/) provides a flexible toolbox for selecting individual colors or color palettes, manipulating these colors, and employing them in statistical graphics and data visualizations. In particular, the package provides a broad range of color palettes based on the HCL (Hue-Chroma-Luminance) color space. The three HCL dimensions have been shown to match those of the human visual system very well, thus facilitating intuitive selection of color palettes through trajectories in this space.
Namely, general strategies for three types of palettes are provided: (1) Qualitative for coding categorical information, i.e., where no particular ordering of categories is available and every color should receive the same perceptual weight. (2) Sequential for coding ordered/numeric information, i.e., going from high to low (or vice versa). (3) Diverging for coding ordered/numeric information around a central neutral value, i.e., where colors diverge from neutral to two extremes.
To aid selection and application of these palettes the package provides scales for use with ggplot2; shiny (and tcltk) apps for interactive exploration (see also http://hclwizard.org/); visualizations of palette properties; accompanying manipulation utilities (like desaturation and lighten/darken), and emulation of color vision deficiencies. |
14:18 |
|
Visualisation |
Ian Lyttle
|
Vegawidget: Composing and Rendering Interactive Vega(-Lite) Charts |
William Chase |
Saint-Exupéry |
|
Vega-Lite, alongside Vega, is a JavaScript implementation of an interactive grammar-of-graphics, developed by the Interactive Data Lab at the University of Washington. You build chart specifications using JSON; Vega(-Lite) renders your specifications as charts in your browser. The vegawidget package (on CRAN) is an htmlwidgets interface to Vega(-Lite), letting you compose specifications using R lists. The package offers functions to help build chart specifications and to render them as htmlwidgets. It also offers functions to define interactivity between Shiny and Vega(-Lite) via datasets, events, and signals (reactive variables). You can also define interactivity using JavaScript in an R Markdown document. Although Vega-lite offers an interactive grammar-of-graphics, this package offers a low-level interface for composing chart-specifications. As a result, vegawidget is designed to be extensible, making it easier to develop higher-level, user-friendly packages to build specific types of charts, or even to build a general ggplot2-like framework, using vegawidget as a foundation. Package website: https://vegawidget.github.io/vegawidget |
14:36 |
|
Visualisation |
Ursula Laa
|
Visualising high-dimensional data: new developments of the tourr package using Shiny and plotly |
William Chase |
Saint-Exupéry |
|
The tour is a tool for the visualisation of multi-dimensional structures by means of dynamic projections, implemented the R package tourr (Wickham et al., 2011). Availability in R means that we can readily extend it with features from other packages. In this talk I will show how we can use Shiny and plotly to create a graphical interface, enhancing usability for non-experts and allowing for interactive features like linked brushing in the tour display, stopping and restarting with new settings, or hover text information on data points. In addition I will show how index functions scoring the "interestingness" of 2D projections, available in various R packages, can be combined with the guided tour, steering projections towards more interesting views of the data. |
14:54 |
|
Visualisation |
Jim Harner
|
xstatR: an Environment for Running R and XLISP-STAT in Docker Containers |
William Chase |
Saint-Exupéry |
|
R is the lingua franca of statistical computing, but it lacks a strong API for interactive, dynamic graphics. Shiny and ggvis are excellent, but they do not support dynamic actions (brushing, geodesic rotations, etc.). XLISP-STAT and R were the dominant computing and graphics platforms in the nineties and early 2000s, but it became clear that only a single open source platform was viable and the community chose R. However, it is now common to use multiple platforms, e.g., R and Spark or R and Python. xstatR in an environment that combines R and XLISP-STAT within a single Docker container (https://github.com/jharner/xstatR). R and XLISP-STAT run as separate Linux processes in a container with a bridge between them, which allows saved R objects to be translated into Lisp objects (or vice versa). Models typically are built in R and exploratory and diagnostic dynamic plots are created in XLISP-STAT. Since XLISP-STAT also supports windows, menus, etc., a graphical interface has been developed which obviates the need for users to learn Lisp. Prototype XLISP-STAT packages have been built for model-based interactive diagnostic plots, multivariate visualizations, and GGobi-type dynamic graphs. The xstatR container can be deployed locally or to any cloud service, e.g., AWS. We are refactoring xstatR to run XLISP-STAT as a subprocess of R, which will allow XLISP-STAT to more-directly access R objects. |
14:00 |
|
Bioinformatics 2 |
Michael Lawrence
|
Interfacing R/Bioconductor with Hail, a Spark-based platform for genomics |
Sina Rueger |
Guillaumet 1+2 |
|
Hail is a Spark-based framework for genomic computing at scale. We have explored the application of deferred evaluation to the construction of an interface between R and Hail. The interface implements standard base R and Bioconductor APIs on top of Hail by constructing expressions in Hail's interface language and evaluating them using sparklyr. Users require no special knowledge of Hail or Spark. We will describe the design of the interface and demonstrate the manipulation of a Hail-backed SummarizedExperiment object, the core abstraction for genomic data in Bioconductor. |
14:18 |
|
Bioinformatics 2 |
Federico Marini
|
iSEE: interactive and reproducible exploration and visualization of genomics data |
Sina Rueger |
Guillaumet 1+2 |
|
Data exploration is crucial in the comprehension of large biological datasets, generated by high-throughput assays such as sequencing, with interactivity as key aspect to generate insightful outputs. Most existing tools for intuitive and interactive visualization are limited to specific assays or analyses, and lack support for reproducible analysis. Sparked from a Bioconductor community-driven effort, we have built a general-purpose tool, iSEE - Interactive SummarizedExperiment Explorer, designed for interactive exploration of any experimental data which can be stored in a SummarizedExperiment object, i.e. an integrative data container for storing matrices of assays and tables of associated metadata. iSEE (https://bioconductor.org/packages/iSEE/) is implemented in R and Shiny, and is compatible with many existing R/Bioconductor packages for high-throughput biological data. Essential features include: - A highly customizable interface with different panel types, simultaneously viewing and linking panels to each other
- Automatic tracking of the exact R code generating all visible plots for full reproducibility
- Interactive tours to showcase datasets and findings
- Extendable analyses with custom panel types
- Seamless deployment as an online companion browser for collaborations and publications. |
14:36 |
|
Bioinformatics 2 |
Pol Castellano-Escuder
|
POMA: Shiny tool for targeted metabolomic data statistical analysis and visualization |
Sina Rueger |
Guillaumet 1+2 |
|
Similarly to other high-throughput technologies, metabolomics usually faces a data mining challenge to provide an understandable and useful output to advance in biomarker discovery and precision medicine. Biological interpretation of the results is one of the hard points and several bioinformatics tools have emerged to simplify and improve this step. However, sometimes these tools accept only very simplistic data structures and, for example, they do not even accept data with several covariates. POMA is a free, friendly and fast Shiny interface for analysing and visualization data after an analytical targeted metabolomics process and its hosted on https://polcastellano.shinyapps.io/POMA/. POMA allows the user to go from the raw data to statistical analysis. The analysis is organized in three blocks: "Load Data" (where user can upload metabolite data and a covariates file), "Pre-processing" (value imputation and normalization) and "Statistical analysis" (univariate and multivariate methods, limma, correlation analysis, feature selection methods, random forest, etc.). These steps include multiple types of interactive data visualization integrated in an intuitive user interface that requires no programming skills. Finally, POMA also generates different automatic statistical and exploratory reports to facilitate the analysis and interpretation of the results. |
16:00 |
|
Keynote |
Martin Morgan |
How Bioconductor advances science while contributing to the R language and community |
Christine Choirat |
Concorde 1+2 |
|
The Bioconductor project has had profound influence on the statistical analysis and comprehension of high-throughput genomic data, while contributing many innovations to the R language and community. Bioconductor started in 2002 and has grown to more than 1700 packages downloaded to ½ million unique IP addresses annually; Bioconductor has more than 30,000 citations in the scientific literature, and positively impacts many scientific careers. The desire for open, reproducible science contributes to many aspects of Bioconductor, including literate programming vignettes, multi-package workflows, teaching courses and online material, extended package checks, use of formal (S4) classes, reusable ‘infrastructure’ packages for robust and interoperable code, centralized version control and support, nightly cross-platform builds, and a distinctive release strategy that enables developer innovation while providing user stability. Contrasts between Bioconductor and R provide rich opportunities for reflection on establishing open source communities, how users translate software into science, and software development best practices. The ever-changing environment of scientific computing, especially the emergence of cloud-based computation and very large and heterogeneous public data resources, point to areas where Bioconductor, and R, will continue to innovate. |