- UC Libraries Statistics Schedule F Reporting Tips
- Extent of content hosted in Nuxeo
UC Libraries Statistics Schedule F Reporting Tips
Each year, every campus library prepares an annual snapshot of holdings information which are collocated by CDL on behalf of the UC Libraries. Schedule F, in particular, is used to report digital holdings for a given year. We have the following recommendations for factoring in Nuxeo-based holdings:
- Number of collections: Count number of project folders in Nuxeo (if your project folders have a one-to-one alignment with how your library model "collections"). Alternatively, count the total number of your Nuxeo-based collections that have been published in Calisphere. (If you take the latter approach, we recommend indicating a caveat that you are only counting published collections).
- Megabytes: You can obtain total file size stats in Nuxeo -- see instructions below on obtaining extent information.
- Number of items in Nuxeo: Contact us to obtain a count of the total number of items in Nuxeo. In this scenario, a simple object represents 1 item; a complex object likewise represents 1 item (the count does not include components within the complex object) Note that “item” counts are differentiated from Nuxeo “document” counts, as discussed further below.
- Number of items published to Calisphere: Extent statistics for the number of items published in Calisphere are available at https://voro.cdlib.org/calisphere.org/. Again in this scenario, a simple object represents 1 item; a complex object likewise represents 1 item (the count does not include components within the complex object. Note that “item” counts are differentiated from Nuxeo “document” counts, as discussed further below). And, just in case you're also interested in Usage Stats for your published objects, you can read about how to get those in the Calisphere guide.
Extent of content hosted in Nuxeo
In 2022, the Nuxeo extent stats were improved to provide more robust data. The sections below provide additional information on how to view/download extent reports and an explanation of the data elements.
Nuxeo extent stats (beginning in 2022)
Extent statistics are generated annually for each campus, and are available in the Admin project folder, within Nuxeo itself. The Admin project folder includes a directory for each campus, and the files containing extent stats data are arranged by the year.
There are two types of files that are included in the annual reporting:
- .txt:This text document includes a list of all the individual objects throughout the campus project folders. Data elements include:
- uid: This indicates the unique identifier for the object, automatically generated and assigned by Nuxeo.
- path: This indicates the full path/directory for the file in Nuxeo.
- .xlsx: This spreadsheet includes summary data for each project folder, within the campus-level directory. Here is a brief explanation of the individual data elements, found in the spreadsheet (.xlsx) data:
- Project Folder: Title of Project Folder.
- Doc Count: Total number of Nuxeo documents, within the Project Folder. Every resource (regardless of format) created in Nuxeo is based on a Nuxeo document type, and managed as a document. A simple object represents 1 document; a complex object represents 1 parent-level document, and each individual component comprises additional documents.
- Unique Main File Count: Total number of files managed within the “Main Content File” section of Nuxeo objects.
- Main File Size: Total size of the files managed within the “Main Content File” section of Nuxeo objects.
- Unique “Files Tab” Count: Total number of files managed within the “Files” tab of Nuxeo objects.
- “Files Tab” Size: Total size of the files managed within the “Files” tab section of Nuxeo objects.
- Unique Aux File Count: Total number of supplemental or variant versions of files, managed in the “Auxiliary Files” section of Nuxeo objects.
- Aux File Size: Total size of supplemental or variant versions of files, managed in the “Auxiliary Files” section of Nuxeo objects.
- Unique Derivative File Count: Total number of converted/auto-generated files inNuxeo, as additional service or access copies. These files appear within the “Conversions” section of Nuxeo objects.
- Derivative File Size: Total size of converted/auto-generated files inNuxeo, as additional service or access copies. These files appear within the “Conversions” section of Nuxeo objects.
- Total Unique File Count: Tally of Unique Main File Count + Unique “Files Tab” + Unique Aux File Count + Unique Derivative File Count
- Total File Size: Tally of Main File Size + “Files Tab” Size + Aux File Size + Derivative File Size
By default, these reports are generated from the campus folder level, and includes summary data for each project folder at the top-level. Contact us if you need to report extent data at a more specific project folder level; we are able to generate stats on an ad hoc basis. These will be available in the Admin/Campus/Year folder as well.
Note that the pre-2022 Nuxeo extent stats counted files in the Nuxeo trash (i.e., stored in a "trashed state" in Nuxeo). The current stats generation process was recently improved so that it would no longer count files in the Nuxeo trash in the extent data. However, we have identified that the Nuxeo API (used for report generation, as Nuxeo does not have built-in reporting) does not always accurately account for objects deeply nested within many layers of project folders; hence there may be some undercounting of objects.
We have an escalated support ticket with Nuxeo to address this problem, but recognizing the timely need for these reports in September 2022, we have worked to mitigate this issue in the current reports by programmatically and manually reviewing them, iterating on our queries until we no longer observed missing (or uncounted) objects. It is possible that some objects have not been accounted for, but we feel this is likely a small number. We apologize for the uncertainty around these reports; we will be working energetically with Nuxeo to get this issue resolved.
Nuxeo extent stats (pre-2022)
Pre-2022 extent statistics for the total file size of objects in Nuxeo are available in the Admin/Aggregate (pre-2022) project folder, within Nuxeo itself. The "deduplicated count" columns reflect the total size of truly unique files managed in Nuxeo, which includes 1) Main Content Files and auxiliary/supplemental files directly imported into Nuxeo, and 2) derivative files automatically generated by Nuxeo. Note that this count also included files in the Nuxeo trash (ie, stored in Nuxeo in a "trashed state").
We also maintain logs for every file in Nuxeo with filesize and fixity data. The logs are available in the Admin project folder within Nuxeo itself, and the logs are subsetted into individual project folders for each campus library (e.g., Admin/UCR).
Here is an example row of data, from one of the files, with an explanation of the individual data elements:
- name=sim_182_000_011_0010.tif data=http://localhost:8080/nuxeo/nxfile/default/4afa5bfb-d589-4064-803b-76886d8e3d2d/file:content/sim_182_000_011_0010.tif
uid: The first element is the unique identifier for the object, automatically generated and assigned by Nuxeo
path: This indicates the full path/directory for the file in Nuxeo:
xpath: Nuxeo documents can be tought of as XML documents with a special "binary" node type. This is an XPath in the Nuxeo document that can be used to access the file.
name: This is the source filename for the Main Content File, as ingested into Nuxeo.
data: This URL (with modifications) can be used as the basis to download the file
md5: this is a checksum for the file. It is also used as the filename for the object within the context of Amazon S3.
size: Filesize, in bytes
size_h: Filesize in 1024 based metric units.
media: This designes the MIME type for the file