JSON for S4 Objects

library(mongo)
#> Loading required package: futile.logger

The jsonlite package provides two mechanisms for converting an arbitrary object to JSON, toJSON and serializeJSON. For S4 classes neither of these messages are whole satisfactory. There is probably no existing method for toJSON, and serializeJSON produces a format which is readable by another R program, but is difficulty to parse for any other program (e.g., a Mongo (r) database).

Example Class

To illustrate the problem, I will start with an example class: an event in a log file. (This is a simplification of the P4Message class in the Proc4 package[For more information about the Proc4 package, see https://pluto.coe.fsu.edu/Proc4, https://github.com/ralmond/Proc/, and https://ralmond.r-universe.dev/Proc4.].)

setClass("Event",
         slots=list(uid="character",
                    mess="character",
                    timestamp="POSIXt",
                    processed="logical",
                    data="list"
         ),
         contains="MongoRec")

anEvent <- new("Event",uid="Tester",mess="Typed",
               timestamp=as.POSIXct("2022-05-30 16:38:58 EDT"),
               processed=FALSE,
               data=list(input="Hello, World!"))

The event class can be loaded by running mongo::load_example(). The complete source file can be found using the command system.file("examples","Event.R",package="mongo").

The event class is divided into a number of header fields, and the data, which could be anything. As such, it is perfect for storing in the Mongo database. Mongo can build indexes in the header fields, making it straightforward to build a query that asked for all of the events where Fred typed something that has not yet been processed sorted by timestamp. The data field can be anything that can be serialized in JSON.

Note that this class is also a subclass of MongoRec which is a small class with a single slot _id (accessed with the function m_id(). This is the Mongo identifier of the object. The goal of object oriented programming is that we don’t need to worry about the details in our package, as long at the authors of the MongoRec class have done their job properly.

Using toJSON to serialize an S4 class

Neither toJSON nor serializeJSON work out of the box. The toJSON function simply throws an error, as the new class is not yet registered. The function serializeJSON generates proper JSON output, but not in a format for which it will be easy to write queries about.

try(jsonlite::toJSON(anEvent))
#> Error : No method for S4 class:Event
jsonlite::serializeJSON(anEvent)
#> {"type":"S4","attributes":{"uid":{"type":"character","attributes":{},"value":["Tester"]},"mess":{"type":"character","attributes":{},"value":["Typed"]},"timestamp":{"type":"double","attributes":{"class":{"type":"character","attributes":{},"value":["POSIXct","POSIXt"]},"tzone":{"type":"character","attributes":{},"value":[""]}},"value":[1653928738]},"processed":{"type":"logical","attributes":{},"value":[false]},"data":{"type":"list","attributes":{"names":{"type":"character","attributes":{},"value":["input"]}},"value":[{"type":"character","attributes":{},"value":["Hello, World!"]}]},"_id":{"type":"character","attributes":{},"value":[]}},"value":{"class":"Event","package":".GlobalEnv"}}

A simple trick produces something a lot closer to usable JSON output. First, apply the attributes() function to turn the object into a list, and then toJSON() to turn the list into JSON.

jsonlite::toJSON(attributes(anEvent),pretty=TRUE)
#> {
#>   "uid": ["Tester"],
#>   "mess": ["Typed"],
#>   "timestamp": ["2022-05-30 16:38:58"],
#>   "processed": [false],
#>   "data": {
#>     "input": ["Hello, World!"]
#>   },
#>   "_id": [],
#>   "class": ["Event"]
#> }

This is almost there. The toJSON function has turned all of the scalars into vectors. That is because, in R, there is no difference. However, it is going be harder to build the queries if the header fields are all vectors. The function jsonlite::unbox() takes care of this. The function unboxer() takes this one step further, adding the unbox flag to any scaler it finds, no matter how deeply nested.

jsonlite::prettify(
  jsonlite::toJSON(lapply(attributes(anEvent),unboxer)),
  indent=2)
#> {
#>   "uid": "Tester",
#>   "mess": "Typed",
#>   "timestamp": "2022-05-30 16:38:58",
#>   "processed": false,
#>   "data": {
#>     "input": "Hello, World!"
#>   },
#>   "_id": [
#> 
#>   ],
#>   "class": "Event"
#> }
#> 

This is much closer, but still a bit crude. In particular, it is not particularly generalizable. The as.json function and as.jlist helper function provide a more generalizable method for handling S4 classes.

The as.json and as.jlist functions.

The as.json() function implements the following algorithm for converting an S4 class to JSON is as follows:

  1. Convert the object to a list using attributes(); this list will be called a jlist in the sequel.
  2. Massage the jlist to mark elements that need special handling.
  3. Call toJSON() on the resulting list.

Steps 1 and 3 are common to many different S4 objects, it is step 2 that needs customization for each one. So as.json() calls as.jlist() to perform this step. The default method for as.json() is shown below. (The function as.json() is generic, so if necessary a new method can be introduced at this level as well.) Most of the options are simply passed to toJSON.

## as.json,ANY-method
function(x, serialize = TRUE, ...) {
  jlist <- as.jlist(x, attributes(x), serialize)
  mongolite::toJSON(jlist,...)
}

The first argument of as.jlist() is the object itself. This is used mostly for method dispatch, but sometimes it will be easier to work with the original object than the jlist. The second argument, ml, is the jlist (produced by attributes(obj)). The third object is a serialize flag that is meant to indicate whether compound objects should be serialized or not.

setMethod("as.jlist",c("Event","list"),
          function(obj, ml, serialize=TRUE) {
            ml$uid <- unboxer(ml$uid)
            ml$mess <- unboxer(ml$mess)
            ml$timestamp <- unboxer(ml$timestamp)
            ml$processed <- unboxer(ml$processed)
            ml$data <- unparseData(ml$data,
                                   serialize)
            callNextMethod(obj, ml, serialize)
          })

Several things to note about this generic function.

First, it modifies the ml argument, which is what is eventually returned. In fact, the ANY method for as.jlist just returns the ml argument, so that is the base case. Second, it finishes with a call to callNextMethod. The idea is that the MongoRec method for as.jlist knows how to take care of the _id field: callNextMethod ensures that code is called. Finally, the unparseData() function is called to deal with the arbitrary data field. This function is described in the following section.

The next example shows as.json() in action.

jsonlite::prettify(as.json(anEvent), indent=2)
#> {
#>   "uid": "Tester",
#>   "mess": "Typed",
#>   "timestamp": {
#>     "$date": 1653928738000
#>   },
#>   "processed": false,
#>   "data": "{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"input\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"Hello, World!\"]}]}",
#>   "class": "Event"
#> }
#> 

The unparseData function

There are three ways to deal with the large complex object data whose exact composition is unknown. The safest way is to call serializeJSON and turn it into a representation which can be read exactly back into R. This, however, is not inspectable and will not be easy for other programs to parse. The second way is to attempt to apply the unboxer function recursively to try to force scalar elements into scalars in the JSON. These two approaches are implemented using unparseData(data, serialize=TRUE) and unparseData(data, serialize=FALSE) respectively. Note that the serialize argument is passed from as.json to as.jlist, so this can be decided at a later point in time.

jsonlite::prettify(as.json(anEvent, serialize=FALSE),
         indent=2)
#> {
#>   "uid": "Tester",
#>   "mess": "Typed",
#>   "timestamp": {
#>     "$date": 1653928738000
#>   },
#>   "processed": false,
#>   "data": {
#>     "input": "Hello, World!"
#>   },
#>   "class": "Event"
#> }
#> 

The third method is to write custom code for unparsing (turning into something toJSON will properly serialize) and parsing the custom object. This method is superior to the others, but much more costly in terms of programmer time.

Building S4 objects from JSON

So the serialized form of the object has all of the fields needed to rebuild the object. The class field gives the name of the class, and the other field the appropriate slot values. The function fromJSON returns the serialized data as a list (a jlist), so essentially the object can be recreated by new(class,jlist).

The function parse.json function takes this approach. It converts the JSON object to a list (using fromJSON) and then passes the list to a builder function for making the object. The default builder function buildObject function. This calls the parse.jlist method appropriate for the class to clean up the list of fields, then calls new to create the object.

## buildObject
function (rec, class=rec$class) {
  jlp <- selectMethod("parse.jlist",c(class,"list"))
  if (!is.null(jlp))
    rec <- do.call(jlp,list(class,rec))
  rec$class <- NULL # Make sure it is not marked as an extra argument.
  do.call("new",c(class,rec))
}

The actual buildObject implementation is slightly more complicated to try an handle some common S3 cases.

This makes heavy use of the do.call function, which allows the program to compute the call to the function. In the last line of the function, the argument list for new is built by prepending the class to the list of slot values rec, and then new is called on the result.

S4 (and R6) objects work with new function. S3 object constructors do not have a uniform syntax. The default buildObject implementation handles the common implementation of S3 classes as a list of fields with a class attribute. However, in many cases, a custom builder is needed. This can be passed as an argument to as.json (as well as getOneRec and getManyRecs).

Just as the as.json delegate much of its work to the as.jlist, buildObject delegates its work to the parse.jlist function. Again, this is an object orient function, so that each class can process its own unique slots and then callNextMethod to deal with the inherited slots. It bottoms out at an ANY method which removes the class element so it doesn’t in the way when the object is built.)

Note that there is object of the required class available to dispatch the method for the parse.jlist generic function. So buildObject once again uses the do.call trick, this time with the selectMethod function used to find the method from the class name.

Here is the parse.jlist method for the Event object.

setMethod("parse.jlist",c("Event","list"),
          function(class, rec) {
            rec$uid <- as.character(ununboxer(rec$uid))
            rec$mess <- as.character(ununboxer(rec$mess))
            rec$timestamp <- as.POSIXct(ununboxer(rec$timestamp))
            rec$processed <- as.logical(ununboxer(rec$processed))
            rec$data <- parseData(rec$data)
            callNextMethod(class, rec)
          })

For the most part, the steps are just ensuring that all of the slots of the class have the proper value types. The parseData function is the inverse of unparseData and handles both serialized and non-serialized data.

The call to ununboxer seems a bit odd. The reason is mostly to facilitate testing. This function removes the mark that jsonlite adds to indicate that a value should be a scalar and not a vector. That mark will cause all.equal to fail, this makes it harder to text. In particular, after adding the ununboxer calls, the as.jlist and parse.jlist are inverses, and this can be used to build unit tests.

atlist <- attributes(anEvent)
jlist <- as.jlist(anEvent,atlist)
plist <- parse.jlist(anEvent,jlist)
## Need to ensure order is the same.
res <- try(testthat::expect_equal(plist[names(atlist)],atlist))
#> Error : plist[names(atlist)] (`actual`) not equal to `atlist` (`expected`).
#> 
#> `names(actual$_id)` is a character vector ('oid')
#> `names(expected$_id)` is absent
#> 
#> `actual$_id`:   NA
#> `expected$_id`:   
#> 
#> `actual$class` is an S3 object of class <scalar/character>, a character vector
#> `expected$class` is a character vector ('Event')
res
#> [1] "Error : plist[names(atlist)] (`actual`) not equal to `atlist` (`expected`).\n\n`names(actual$_id)` is a character vector ('oid')\n`names(expected$_id)` is absent\n\n`actual$_id`:   NA\n`expected$_id`:   \n\n`actual$class` is an S3 object of class <scalar/character>, a character vector\n`expected$class` is a character vector ('Event')\n"
#> attr(,"class")
#> [1] "try-error"
#> attr(,"condition")
#> <expectation_failure/expectation/error/condition>
#> plist[names(atlist)] (`actual`) not equal to `atlist` (`expected`).
#> 
#> `names(actual$_id)` is a character vector ('oid')
#> `names(expected$_id)` is absent
#> 
#> `actual$_id`:   NA
#> `expected$_id`:   
#> 
#> `actual$class` is an S3 object of class <scalar/character>, a character vector
#> `expected$class` is a character vector ('Event')
#> Backtrace:
#>     ▆
#>  1. ├─base::try(testthat::expect_equal(plist[names(atlist)], atlist))
#>  2. │ └─base::tryCatch(...)
#>  3. │   └─base (local) tryCatchList(expr, classes, parentenv, handlers)
#>  4. │     └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
#>  5. │       └─base (local) doTryCatch(return(expr), name, parentenv, handler)
#>  6. └─testthat::expect_equal(plist[names(atlist)], atlist)

Oops. This failed. The problem is related to the _id field inherited from MongoRec. This can be seen below:

atlist$"_id"
#> character(0)
plist$"_id"
#> oid 
#>  NA

The problem is two different conventions for a missing _id attribute. The Event class needs an initializion method that takes care if this. For the purposes of illustration, this is done manually below.

m_id(anEvent) <- NA_character_
atlist <- attributes(anEvent)
jlist <- as.jlist(anEvent,atlist)
plist <- parse.jlist(anEvent,jlist)
## Need to ensure order is the same.
res <- try(testthat::expect_equal(plist[names(atlist)],atlist))
#> Error : plist[names(atlist)] (`actual`) not equal to `atlist` (`expected`).
#> 
#> `actual$class` is an S3 object of class <scalar/character>, a character vector
#> `expected$class` is a character vector ('Event')
res
#> [1] "Error : plist[names(atlist)] (`actual`) not equal to `atlist` (`expected`).\n\n`actual$class` is an S3 object of class <scalar/character>, a character vector\n`expected$class` is a character vector ('Event')\n"
#> attr(,"class")
#> [1] "try-error"
#> attr(,"condition")
#> <expectation_failure/expectation/error/condition>
#> plist[names(atlist)] (`actual`) not equal to `atlist` (`expected`).
#> 
#> `actual$class` is an S3 object of class <scalar/character>, a character vector
#> `expected$class` is a character vector ('Event')
#> Backtrace:
#>     ▆
#>  1. ├─base::try(testthat::expect_equal(plist[names(atlist)], atlist))
#>  2. │ └─base::tryCatch(...)
#>  3. │   └─base (local) tryCatchList(expr, classes, parentenv, handlers)
#>  4. │     └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
#>  5. │       └─base (local) doTryCatch(return(expr), name, parentenv, handler)
#>  6. └─testthat::expect_equal(plist[names(atlist)], atlist)

A better test

A more complete test would be to create an Event object, write it out to JSON format and read it back, building the object. This test requires an equality test for the Event object, that is a method for all.equal; as this is an S3 generic, the method will be all.equal.Event.

all.equal.Event <- function (target, current,
                             ...,
                             checkTimestamp=FALSE,check_ids=TRUE) {
  if (!is(current,"Event"))
    return(paste("Target is 'Event' and current is '", 
                 class(current),"'."))
  msg <- character()
  if (check_ids)
    if ((is.na(target@"_id") && !is.na(current@"_id")) ||
        (!is.na(target@"_id") &&
         !isTRUE(all.equal(target@"_id", current@"_id"))))
      msg <- c(msg,"Database IDs do not match.")
  if (!isTRUE(all.equal(target@uid,current@uid)))
    msg <- c(msg,"User IDs do not match.")
  if (!isTRUE(all.equal(target@mess,current@mess)))
    msg <- c(msg,"Messages do not match.")
  if (!isTRUE(all.equal(target@processed,current@processed)))
    msg <- c(msg, "Processed flags do not match.")
  ## Check Data
  namet <- names(target@data)
  namec <- names(current@data)
  if (length(target@data) != length(current@data) ||
      !setequal(namet,namec)) {
    msg <- c(msg,"Names or number of data differ.")
    if (length(setdiff(namet,namec)) > 0L)
      msg <- c(msg,paste("Data in target but not in current:",
                         setdiff(namet,namec)))
    if (length(setdiff(namec,namet)) > 0L)
      msg <- c(msg,paste("Data in current but not in target:",
                         setdiff(namec,namet)))
  }
  msgd <- all.equal(target@data,current@data,...)
  if (!isTRUE(msgd)) msg <- c(msg,msgd)
  ## Timestamp
  if (checkTimestamp) {
    if (abs(target@timestamp-current@timestamp) >
        as.difftime(.1,units="secs"))
      msg <- c(msg,"Timestamps differ by more than .1 secs")
  }

  ## Return true if message list is empty.
  if (length(msg)==0L) TRUE
  else msg
}

A few notes about this function. First, all.equal returns true if the target and current entries are the same, and otherwise returns a character vector giving the differences. So, the method builds up a list of differences. Second, remember that NA == NA returns NA, not true. Using all.equal will return TRUE if both arguments are true, but not false if they are different. So isTrue(all.equal(...)) test for equality with NA==NA being marked as true.

So lets test this equality method.

anEvent1 <- anEvent
all.equal(anEvent,anEvent1)
#> [1] TRUE

anEvent1@mess <- "Shouted"
all.equal(anEvent,anEvent1)
#> [1] "Messages do not match."

This looks good, so now for the final check.

evJSON <- as.json(anEvent)
anEvent1 <- buildObject(jsonlite::fromJSON(evJSON),"Event")
all.equal(anEvent,anEvent1)
#> [1] TRUE

Saving and Restoring S4 objects to a mongo database.

A big benefit of being able to store S4 objects as JSON documents and restore them is that they now can be saved into a collection in a Mongo database. The functions saveRec, getOneRec, and getManyRecs facilitate that.

First a reference to a Mongo collection can be made using the mongo::MongoRec class or the mongolite::mongo class. MongoRec is a wrapper for mongolite::mongo which facilitates including a reference to the collection in an S4 or R6 class.

eventCol <- MongoDB("Event")

The function saveRec saves a record to a collection.

load_example() ## Loads some sample events.
anEvent <- saveRec(eventCol,anEvent)
sampleEvents <- lapply(sampleEvents, function(e) saveRec(eventCol,e))

Note that in each case the Event object is saved back into the same variable. The saveRec function returns the input object, but now has the Mongo ID (m_id field) set. The saveRec function uses this field to determine whether it will add the record as a new document in the collection (is.na(m_id(obj))) or replace the object with the Mongo ID m_id(obj).

The following example illustrates the idea:

mdbDrop(eventCol) ## This clears the collection.
## Insertion
newEvent <- Event(uid="Student1",mess="Startled",data=list(browser="chromium"))
m_id(newEvent)
mdbCount(eventCol,buildJQuery(uid="Student1")) ## Should be 0
newEvent <- saveRec(eventCol,newEvent) ## Inserts
m_id(newEvent) ## Set during the save.
mdbCount(eventCol,buildJQuery(uid="Student1")) ## Should now be 1.
savedEvent <- getOneRec(eventCol,buildJQuery(uid="Student1"))
all.equal(savedEvent, newEvent)
m_id(newEvent)  ## non-NA value indicates that it has a representation in the database.
newEvent$mess <- "Started" ## Fix misspelling
mdbCount(eventCol,buildJQuery(uid="Student1")) ## Should be 1
newEvent <- saveRec(eventCol,newEvent)  ## Replaces existing event.
mdbCount(eventCol,buildJQuery(uid="Student1")) ## Should still be 1
savedEvent <- getOneRec(eventCol,buildJQuery(uid="Student1"))
savedEvent$mess ## Internal object was modified.

Note that getOneRec (also getManyRecs) and mdbCount take a query argument. This is a JSON expression restricting the values of one or more fields. The simplest version just selects records which match on the listed field. The buildJQuery function converts from R syntax to JSON syntax.

There are two differences between getOneRec and getManyRecs. The first function returns a single object, and the second function returns a list of objects. If there are multiple objects which match the query, then getOneRec will return them all.

mdbDrop(eventCol) ## reset the collection contents
load_example() ## Loads some sample events.
sampleEvents <- lapply(sampleEvents, function(e) saveRec(eventCol,e))
mdbCount(eventCol,buildJQuery(uid="Fred"))
getOneRec(eventCol,buildJQuery(uid="Fred"))
length(getManyRecs(eventCol,buildJQuery(uid="Fred")))

mdbCount(eventCol,buildJQuery(uid="Phred")) ## Phred event is unique
getOneRec(eventCol,buildJQuery(uid="Phred"))
getManyRecs(eventCol,buildJQuery(uid="Phred")) ## Returns a list.

```

Summary: JSON Representation of S4 objects

The serializeJSON and unserializeJSON methods provide the safest way of storing an R object as a JSON string; however, the representation is not friendly for other applications. In particular, if the object is stored in an Mongo database, and indexes are built on some of the fields, a class-specific approach is needed to map the slots of the S4 class to Mongo data types.

The as.jlist and parse.jlist generic functions provide a regular mechanism for describing the transformation. The slots of the object are elements of the jlist, and the as.jlist function is responsible for doing pre-processing on the objects before toJSON is called, and parse.jlist is responsible for post-processing after fromJSON is called.

The generic function as.json can then be called to do the conversion, and the function buildObject can be called to reconstruct an object from the output of fromJSON. Note that the basic mechanism may not work for all classes. Some classes may need a special method for as.json. Although buildObject is not generic, if needed, the developer can always write a buildObject function to do the work.

Writing serialization and unserialization methods, like writing print and equality testing methods, is part of the work required when building a new class. Although this framework doesn’t do all the work, hopefully it makes the process a little bit more straightforward, providing a mechanism which can exploit the inheritance structure of the objects.