-
Notifications
You must be signed in to change notification settings - Fork 1
Expand file tree
/
Copy pathfunction.Rmd
More file actions
279 lines (219 loc) · 11.2 KB
/
function.Rmd
File metadata and controls
279 lines (219 loc) · 11.2 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
---
title: "R functions"
author: "Brian S. Yandell"
date: "7/5/2017"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
Below are some direct quotes from various sources on what functions are and why they are useful for data scientists:
[R4DS](http://r4ds.had.co.nz/functions.html): Functions allow you to automate common tasks in a more powerful and general way than copy-and-pasting. Writing a function has three big advantages over using copy-and-paste:
* You can give a function an evocative name that makes your code easier to understand.
* As requirements change, you only need to update code in one place, instead of many.
* You eliminate the chance of making incidental mistakes when you copy and paste (i.e. updating a variable name in one place, but not in another).
... You should consider writing a function whenever you’ve copied and pasted a block of code more than twice (i.e. you now have three copies of the same code).
... Functional programming (FP) offers tools to extract out this duplicated code, so each common for loop pattern gets its own function. Once you master the vocabulary of FP, you can solve many common iteration problems with less code, more ease, and fewer errors.
[JB](http://stat545.com/cm102_writing-functions.html): My goal here is to reveal the process a long-time useR employs for writing functions. I also want to illustrate why the process is the way it is. Merely looking at the finished product, e.g. source code for R packages, can be extremely deceiving. Reality is generally much uglier … but more interesting! ... Build that skateboard before you build the car or some fancy car part. A limited-but-functioning thing is very useful. It also keeps the spirits high.
[MRC](https://maryrosecook.com/blog/post/a-practical-introduction-to-functional-programming): Functional code is characterised by one thing: the absence of side effects. It doesn’t rely on data outside the current function, and it doesn’t change data that exists outside the current function.
[CMU](http://www.stat.cmu.edu/~cshalizi/402/programming/writing-functions.pdf): The ability to read, understand, modify and write simple pieces of code is an essential skill for modern data analysis. ... Someone who just knows how to run canned routines is not a data analyst but a technician who tends a machine they do not understand. Fortunately, writing code is not actually very hard, especially not in R. All it demands is the discipline to think logically, and the patience to practice.
- [User Written Function (Quick-R)](https://www.statmethods.net/management/userfunctions.html)
- [Write your own R functions by Jenny Bryan](http://stat545.com/cm102_writing-functions.html)
+ [part 1: bare bone basics](http://stat545.com/block011_write-your-own-function-01.html)
+ [part 2: generalize for multiple uses](http://stat545.com/block011_write-your-own-function-02.html)
+ [part 3: missing values (`NA`) and pass through (`...`)](http://stat545.com/block011_write-your-own-function-03.html)
+ [linear regression function in detail](http://stat545.com/block012_function-regress-lifeexp-on-year.html)
+ [split-apply-combine the function to all countries](http://stat545.com/block013_plyr-ddply.html#recall-the-function-we-wrote-to-fit-a-linear-model)
- [R4DS book](http://r4ds.had.co.nz/): [Functions](http://r4ds.had.co.nz/functions.html) & [Iteration](http://r4ds.had.co.nz/iteration.html)
- [Adv-R book](http://adv-r.had.co.nz/): [Functional Programming](http://adv-r.had.co.nz/Functional-programming.html) & [Environments](http://adv-r.had.co.nz/Environments.html)
- [How to write and debug an R function (R Bloggers)](https://www.r-bloggers.com/how-to-write-and-debug-an-r-function/)
- [Writing R Functions (CMU)](http://www.stat.cmu.edu/~cshalizi/402/programming/writing-functions.pdf)
- [Introduction to R Functions (UC Berkeley)](https://www.stat.berkeley.edu/~statcur/Workshop2/Presentations/functions.pdf)
- [DataCamp Tutorial on R functions](https://www.datacamp.com/community/tutorials/functions-in-r-a-tutorial)
- [Kickstarting R: functions](https://cran.r-project.org/doc/contrib/Lemon-kickstart/kr_rfunc.html)
- [Practical intro to functional programming using python (MRC)](https://maryrosecook.com/blog/post/a-practical-introduction-to-functional-programming)
- [indirection in programming](https://en.wikipedia.org/wiki/Indirection)
miscellaneous notes:
- functional programming (prequel to functions)
- anonymous function, function closure
- `stopifnot()` and `if() {stop()}`
- turning interactive code into a function
- `return()` values and `invisible()`
* * * * * * *
## Simple function
A simple function has a name, possibly one or more arguments, and a body.
```{r}
calcGCPct <- function(xSeq) {
length(which(xSeq %in% c("C","G"))) / length(xSeq)
}
```
```{r}
mySeq <- sample(c("A","C","G","T"), 1000, replace = TRUE)
calcGCPct(mySeq)
calcGCPct(mySeq[1:100])
```
## Arguments
- order: full names, abbreviated names, order
- assignment: `=` vs. `<-`
- elipsis (`...`) for arbitrary additional arguments
- defaults
- use arguments to avoid cut and paste
- sensible conventions for argument names
+ mean something to you
+ match use in functions you call
```{r}
makeVector <- function(aa, ab, ba) { c(aa, ab, ba) }
makeVector(1, 2, 3)
makeVector(1, 2, aa = 3)
makeVector(1, ab = 3,a = 2)
makeAlphanumericSample <- function( alphabet,
seqLength = 1000,
useProbabilities = NULL) {
if(is.null(useProbabilities))
useProbabilities <- rep(1/length(alphabet), length(alphabet))
sample(alphabet, seqLength, replace=TRUE, prob=useProbabilities)
}
makeDNASample <- function(...){
makeAlphanumericSample(c("A","C","G","T"), ...)
}
makeDNASample(seqLength = 100)
makeDNASample(seqLength = 100,
useProbabilities = c(0,.5,.5,0))
```
## Return values:
- returns last evaluated expression.
- thus only one object, but object can be a list of objects
- Hadley Wickham suggests reserving "return(expr)" for early returns.
The following returns `NULL`:
```{r}
testFunction <- function(){}
testFunction()
```
```{r}
testFunction <- function(x) 5+7
testFunction()
```
```{r}
testFunction <- function(x) { 5+7; 13 }
testFunction()
```
```{r}
testFunction <- function(x){
if(missing(x))
return(0)
x^2
}
testFunction()
testFunction(4)
```
## Scope
- R looks first in the local environment for a variable
- if not found, it looks at the environment of the calling function
- can force use of Global environment by using <--
```{r}
testFunction <- function(x){
y <- 5
x + y
}
rm(y)
```
```{r}
testFunction(2)
```
```{r eval = FALSE}
y
```
```
## Error: object 'y' not found
```
```{r}
y <- 2
testFunction(2)
y
```{r}
testFunction <- function(x) {
v1 <- x + y
y <- 10
v1 + y
}
y<-2
testFunction(4)
```
```{r}
testFunction <- function(x){
y <<- 5
x + y
}
rm(y)
testFunction(2)
y
```
## Functions and environments
A function has its own environment. Within a function, you can define a function with its own environment,
which can be real handy. Karl Broman uses this in [R/qtl2](http://kbroman.org/qtl2/assets/vignettes/user_guide.html#connecting_to_snp_and_gene_databases) to query an SQL database for genes and variants. The idea draws on [Adv-R book: Environments](http://adv-r.had.co.nz/Environments.html).
The details can be found in for instance [create_gene_query_func](https://github.com/rqtl/qtl2/blob/master/R/create_gene_query_func.R), but here is the basic idea. Set up a `create` function to create a function that calls your function. Here, we want the created function to have arguments `chr`, `start` and `end` (chromosome name, and `start` and `end` of region of interest to extract information). Your function has these arguments and more, specifically a database file names (`dbfile`).
```{r}
create_query_func <-function(dbfile) {
function(chr, start, end) {
mydbfunc(chr, start, end, dbfile)
}
}
```
Now you run this with the name of your `dbfile` to create the function that will be called.
```{r}
qf <- create_query_func("blah")
```
You can now call this as `qf("1", 34, 35)` to get information on chromosome "1" between 34 and 35, without even knowing where the database file is or what type it is. In particular, you can pass this function as an argument to another function.
If you examine this function, you will see it has a unique environment,
```{r}
qf
```
with a name, which you can identify with
```{r}
environment(qf)
```
What's more, you can list objects in this environment
```{r}
ls(environment(qf))
```
or go further and look at those objects in detail to see the value of `dbfile` (or other arguments you set up)
```{r}
ls.str(environment(qf))
```
Further, you can save this function, complete with its environment using `saveRDS`, to later retrieve it with `readRDS`.
```{r eval=FALSE}
saveRDS(qf, file = "qf.rds")
qf <- readRDS("qf.rds")
```
Here is a more complicated example. This uses the [AnnotationHub](https://bioconductor.org/packages/release/bioc/html/AnnotationHub.html) package, which is a central location to discover genomic files (see [HowTo](http://bioconductor.org/packages/devel/bioc/vignettes/AnnotationHub/inst/doc/AnnotationHub-HOWTO.html)).
The creation function pulls in the [Ensembl](https://ensembl.org) genes for mouse, stores them in an object called `ensembl`, and returns a function that has the `ensembl` object embedded. The `AnnotationHub` and all its [Bioconductor](https://www.bioconductor.org/) helper packages (there are many) are needed to create the gene query, but subsequent use of `query_gene_AH` does not require anything from `Bioconductor`. The `query_gene_AH` function need only be recreated whenever the `AnnotationHub` entries change substantially. This created function can be saved using `saveRDS` as indicated above, and then transported to use in various pipelines.
```{r eval=FALSE}
create_gene_query_func_AH <- function(pattern = c("ensembl", "gtf", "mus musculus", "90"),
filename = "Mus_musculus.GRCm38.90.gtf",
chr_field = "seqnames",
start_field = "start", stop_field = "end",
filter = NULL) {
if(is.null(pattern) & is.null(filename))
stop("must provide pattern and filename")
# Visit AnnotationHub to get ensemble entries.
hub <- AnnotationHub::AnnotationHub()
hub <- AnnotationHub::query(hub, pattern)
ensembl <- hub[[names(hub)[hub$title == filename]]]
ensembl <- as.data.frame(ensembl[ensembl$type == "gene"])
colnames(ensembl)[colnames(ensembl) == "gene_id"] <- "ensembl_gene"
colnames(ensembl)[colnames(ensembl) == "gene_name"] <- "symbol"
ensembl[[start_field]] <- ensembl[[start_field]] * 10^-6
ensembl[[stop_field]] <- ensembl[[stop_field]] * 10^-6
ensembl <- ensembl[, sapply(ensembl, function(x) !all(is.na(x)))]
function(chr, start = NULL, end = NULL) {
subset_ensembl <- ensembl[[chr_field]] == chr
if(!is.null(start))
subset_ensembl <- subset_ensembl & (ensembl[[start_field]] >= start)
if(!is.null(end))
subset_ensembl <- subset_ensembl & (ensembl[[stop_field]] >= end)
ensembl[subset_ensembl,]
}
}
cat("AnnotationHub call\n", file = stderr())
query_gene_AH <- create_gene_query_func_AH()
```