Just found these notes while clearing out old files, from an abortive Quarto blog years ago. I ended up going down a Web development rabbithole and got fed up with both Web development and Quarto.
New blog
I have never tried blogging before, though I have often considered it. I showed a few interesting things to a colleague recently, who asked me if I had a blog. So I have just spent half a day (of annual leave) setting up a blog with the R package blogdown (which uses Hugo), GitHub and Netlify. Kudos to https://github.com/ojroques who created and shared the theme I am using. I am using ForwardEmail to forward email from the email address below to my own personal email. So it has all cost me nothing except time (OK I have to pay for the domain name), which was part of the challenge.
This occasional blog is just for me to share things that might be useful to others, like training materials in draft, or where I have managed to help someone with a problem and learnt something in the process, or how to do things, or useful collated resources, and also for me to clarify my thinking in some areas by writing it down. It is not for self-promotion and I am not looking for another job, so I won’t be tweeting about it or otherwise advertising it - if you are reading this, you are a member of a very select set that I have shared this with. Check out the RSS feed below if you are interested in following the blog. I probably won’t post more than weekly. I follow a lot of blogs (too many) using a self-hosted instance of Miniflux which I read on my phone with FeedMe.
I am a well-rounded geek with a lot of interests, but I intend to restrict this blog to technical matters of possible interest to epidemiologists, analysts of infectious disease surveillance data, R users, Python users, data scientists, and anyone else whose primary job relates to data and analysis, perhaps particularly health data, but who is also interested in other related technologies, in particular open source software. Let’s see how it goes.
First define your terms
Why have I called this blog epidemiologydatascience.net? Data science wasn’t a term I used much prior to the COVID-19 pandemic. I had a vague idea that data scientists were just people who used both R and Python, and probably worked for Google or Facebook.
During the pandemic, those self-identifying as epidemiologists were often thrown together with those self-identifying as data scientists. I realised that some epidemiologists do data science of a sort, and that data scientists can do epidemiology. Some mathematical modellers or statisticians realised that they could call themselves epidemiologists (it was probably not entirely coincidental that this was a term that the general public had become increasingly familiar with) but some could have called themselves data scientists instead. And many noticed how their roles were overlapping with those of their IT and informatics colleagues.
So I think there is a rich multi/interdisciplinary area which could be called epidemiology data science (or epidemiological data science). I am certainly not the first to think this or to consider the implications for traditional skill sets and career pathways. How do we define epidemiology data science (at least for the purposes of this blog)?
Epidemiology is variously defined, e.g. as the (soft) basic science of public health practice, or the discipline of asking what/how many/where/when/who/why questions about the health of populations. Data science is similarly tricky to pin down, as the data science skill set varies according to the domain of application, but at its broadest it could be defined as a discipline that learns from data using any of a wide range of (typically quantitative) analytical methods, sometimes in combination with elements of informatics/data engineering, software development and/or use of information and communication technologies (i.e. computers and networks).
So epidemiology data science could be briefly defined as the interdisciplinary study of population health data using quantitative analytical methods and computing technologies. I could add e.g. “… to protect public health” but I don’t want to exclude those working in academia (for whom knowledge creation could be an end in itself), or in industry (for whom the company bottom line could be the motivator), or any evil epidemiology data scientists who actually want to harm public health.
So what does this mean for someone like me, who came to epidemiology from a health or health science background and could conceivably have learned very little about analytical methods or computing technologies during my career?
When I came into public health, a typical public health professional would be forced to learn some epidemiology and statistics at a couple of points in their early career. Afterwards they would usually forget any theory, which had only been hazily appreciated anyway, but retain enough useful rules of thumb such as “chi-squared test compares proportions” or “t test compares means” to be able to supervise junior staff doing basic analysis; more complicated stuff would definitely require a statistician. My typical public health professional might find learning more advanced methods painful. As a practically-focussed adult learner they might object to learning theory, however simplified, and might expect to be taught a practical and unequivocal rule-based approach similar to what they had retained for chi-squared tests etc (and ideally done by clicking a button), which would only be possible to a degree.
Fortunately, I think this caricature is increasingly untrue. Almost everyone seems to want to “learn R” these days (OK maybe not the Stata and Python users) - that’s an excellent way into epidemiology data science. The diversity of skills has grown in public health teams as a result of the pandemic. We have novel health data sources and more accessible analytical methods and computing technologies than you can shake a stick at, largely thanks to the open source philosophy. Whether we call it that or not, I think the future is bright for epidemiology data science.
In which I start to do something practical
I was going to speak my mind about the kind of skills that I think epidemiology data scientists should acquire, but I have been distracted by something more interesting and practical. It’s actually something related to something that I have been working on previously and now need to make some serious progress with.
The task in hand is: reproduce a surveillance report that is currently produced manually and formatted laboriously in Word. I have done a version in RMarkdown, with a Word template, but I haven’t been able to reproduce a few things like mixing text boxes with a multicolumn format. My version probably looks fine as it is, but I still think I can do better (meaning closer to the original).
There is only so much you can do with basic RMarkdown. But HTML^[HyperText Markup Language] (the markup language behind Web pages) is much more flexible. Markdown was originally conceived as a simplified way of writing Web pages, so if you understand some Markdown you can easily understand basic HTML. I would recommend all epidemiology data scientists to at least do the basic W3Schools HTML course.
HTML is even more flexible when combined with CSS^[Cascading Style Sheets] (which is a way of defining the styles, such as size, colour, font, etc, for each part of a Web page and/or for the page overall). You can do a lot with some understanding of CSS, so doing enough of the W3Schools course to get the general idea should be enough for most epidemiology data scientists. CSS does get very complex beyond a certain level.
The problem with creating a report as a Web page is that it may not look very good once printed, either to paper or PDF. You have little control on where pages begin and end. A few years ago I came across a solution to this called paged.js, a JavaScript library “… that paginates any HTML content to produce beautiful print-ready PDF”. JavaScript is the programming language of the Web, which runs in your browser to give your Web pages interactivity and other functionality. A library is a collection of code which allows a Web page to do something, and it is most often downloaded from the Web at the same time as the Web page using it. Pagination simply means breaking into pages (in a way that is under your control). At the time I didn’t know enough about JavaScript to know where to start with paged.js, though you could clearly do beautiful things with it.
I did play with a self-hosted instance of jsreport for a while, which is a very nice platform allowing you to create Web pages which are already paginated nicely for printing. It did allow me to create the more complex format that I was looking for. But in the end I realised it would not be easily integrated with my RMarkdown workflow.
Fortunately in recent years someone produced an R package which allows you to use paged.js: the pagedown package. This (and related packages) provide templates for various documents such as reports and business cards. So if I can create my own custom pagedown template (using HTML and CSS) I can integrate this nicely into my workflow. If I can make the template generic enough and build it into an R package, it can be used for other similar reports by colleagues.
In my next post I will get the basics working: setting up a new R package, creating a very basic pagedown template in it and putting everything on GitHub. After that I can develop it further and demonstrate something about HTML, CSS, fonts, SVG and perhaps even JavaScript.
In which I become a Web developer
In my quest to develop a pagedown template, I have realised why authors of other pagedown templates tend to provide one or more templates as is, with limited flexibility on overall format. It is because R users (including myself) tend not to be Web developers, and developing a pagedown template (though not using the pagedown package per se) really tests your basic understanding of HTML and CSS. Some examples of this approach are:
If I don’t surpass this with my own template, I hope to expose the internals enough to make it modifiable without being an expert in HTML/CSS.
The next thing I realised is that you don’t need the pagedown package to use paged.js. It provides some convenience functions but you can manage just fine without it. I started off with an existing pagedown template, aiming to modify it for my own needs, as others have done, and ended up with something that worked but which I didn’t fully understand. So I became interested to see if I could start with nothing and create my own template from scratch, without using the pagedown package.
I’ve also realised that to explain what I am doing properly would take several blog posts. So in this post I will focus on how you can use HTML and CSS (with paged.js) to set up a basic Web page that can be printed to paginated PDF. Then I will develop it further with CSS to look like a report. Then I will show how to turn it into a template for RMarkdown. And finally I will make it into a package.
A minimal starting point for an HTML Web page is shown below. You can copy this code into a blank text file, call it something like simple_html_template.html and save it to a new folder on your Desktop. I explain what each part of the HTML means below.
Apart from the first line, which tells your browser that the text file it has just opened is a Web page, HTML is basically text and tags.
- The text is usually what appears on the Web page. We haven’t added any text to our example yet so it would just be a blank page if you opened the file in your browser.
- The tags (the bits surrounded by angle brackets, e.g.
<body>) indicate the structure of the page: what is a heading, what is a paragraph, etc. - Where tags need to surround something, they come in pairs, with an opening tag looking like
<tag> and a closing tag looking like </tag>. It doesn’t matter if paired tags are on the same line as what they are enclosing. - It is common practice to indent things between tags for readability, but that is optional.
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title></title>
</head>
<body>
</body>
</html>
Note that everything after the first line is “nested” within a pair of <html> tags. Within that there are two parts, called the head and the body, which are each demarcated by their own tags. Other things are nested within the head and body tags.
The head is where you put the metadata for the Web page. Metadata (basically meaning “information about information”) is information about the Web page, such as information that the browser needs to show the page correctly. Metadata is usually not of interest to the user and so is not shown in the browser.
One important piece of metadata for the Web page is what character encoding system is in use. Computers use numbers to represent (or, to put it more technically, encode) characters (such as letters, numbers or punctuation) and there are different ways of doing that. When the computer sees a number, if you haven’t told it which system to use, it might show the wrong character. The UTF-8 system is widely used because it compactly represents over 100,000 characters, including those from foreign languages, mathematical symbols etc etc. <meta charset="UTF-8"> is telling the browser to use the UTF-8 system.
The other piece of metadata that is shown here is the page title. The page title is important: it defines what is shown in the browser tab at the top of your page, but it is also what is shown if someone finds your page via a search engine or bookmarks your site. We can make the page title “Report” if we amend the code to show:
<title> Report </title>
You may have noticed that the opening <html> tag contained lang="en". This is also metadata, telling your browser that your page contains English text.
The head section is also where you link your page to other resources such as files containing CSS formatting rules or JavaScript code. We will do this in a minute.
The body section is in general where we put the content that the user will see. To add one simple line of text to the page, we can use paragraph tags (<p>). Our HTML file now looks something like this:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title> Report </title>
</head>
<body>
<p> Hello World!</p>
</body>
</html>
Note that you can miss out certain tags (and let your browser guess which bits of your page are which) but it will cause you problems when you get to the CSS part.
So now we have a valid HTML page. We can open it in our browser and admire our handiwork. The next step is to add pagination. At the moment we need JavaScript to do this, though one day it should be possible to do this using only CSS.
We use <script> tags to include JavaScript in a Web page. You can include actual JavaScript code between these tags, or just point to an external file containing JavaScript code. The external file can be on your own computer or somewhere else on the Internet. paged.js is a JavaScript “library” (collection of code) which adds pagination to your page. It is an example of a “polyfill”: a JavaScript library that enables browsers to do things that they can’t do with existing standard HTML etc. Many such JavaScript libraries are made available for use by “content delivery networks”, or CDNs. CDNs are basically Web services which rapidly provide commonly used resources to Web pages, which basically makes the Internet faster. To add paged.js to our Web page requires only one line of code:
<script src="https://unpkg.com/pagedjs/dist/paged.polyfill.js"></script>
You can put this anywhere between the <html> tags, but it is common to put it at the end of the <body> section (i.e. on the line just before </body>), because sometimes this makes your page appear more quickly to the user.
paged.js allows us to use CSS to paginate a Web page. So let’s add some CSS code. We can add CSS code directly into our Web page using <style> tags, or link our Web page to a file containing CSS code using a <link> tag. The CSS code could again be on our own computer or somewhere else on the Internet, such as on a CDN. Here we will use <style tags to include some CSS directly into our HTML file.
Now our page looks something like this:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title> Report </title>
<style>
@page {
size: A4 portrait;
margin: 50mm;
@bottom-center {
content: counter(page);
}
}
</style>
</head>
<body>
<p> Hello World!</p>
<script src="https://unpkg.com/pagedjs/dist/paged.polyfill.js"></script>
</body>
</html>
Note the part between <style> tags. In CSS code, anything that starts with a @ is a special instruction to the browser. Here we are giving some special instructions to our browser that we want certain things to happen when we print the Web page to PDF. We want the page size to be A4 and in portrait orientation. We want 5cm margins for the page. We also want the bottom centre part of the margin to contain a page counter (i.e. a page number).
If we save our HTML file, open it in our browser and print it to PDF, it should look like this: