In Build 2119 we have begun the process of developing new file formats for our software. In this post we will explain the rationale and introduce the first of the new formats, .dtmx
.
Background
With the new file formats we set about achieving several goals:
Securely embed the provenance of the data. It must be possible for the person who obtains the data to know:
- Where it came from.
- What inputs were used to obtain it.
- What settings were used, by whom, and in which software, for any automatic processing that took place.
Ensure the end user will always be able to decode the data. The files store information that is valuable to those who created it and they should not be hindered from accessing that data just because they do not have access to our software. To this end, the file formats should be self-documenting.
Ensure that the file formats can easily evolve over time. Our existing formats have always been both fowards and backwards compatible (i.e. new versions of the software can read files produced by old versions of the software and old versions of the software can read files produced by new versions of the software, although they may omit information that has been added to the file format by the newer versions) but they are not as efficient nor as easy to adapt to changing needs as they could be.
It may seem that #1 and #2 conflict with each other — after all, if the file is easy to decode, it must also be easy to encode, and therefore it would be easy to tamper with the provenance stored in the file, rendering it useless. Our solution to this is to sign the files once they have been encoded; when decoding the file, our software checks the signature to see if it has been tampered with. If it has not, then the provenance is reliable.
To achieve goals #2 and #3 we decided to borrow a page from Microsoft’s book and adopt the zip file format as our container format. (All of Microsoft’s .*x
file formats — .docx
, .xlsx
,
.pptx
, etc., are actually zip files that can be unzipped using Windows Explorer just by changing the suffix to .zip, or unzipped using 7-Zip just by right-clicking
on the file and selecting one of the “Unzip” options.)
Inside the file, the data is stored as separate files, with one mandatory master file that is always called metadata.json
.
All other data files within the archive must be reachable from that file. We chose JSON because it is both easily machine readable
and human-readable (satisfying #2) and its only downside (verboseness compared to a binary file format) is mitigated by using compression.
For data where the mitigation is insufficient because the verboseness of JSON multiplied by the size of the data would make a sigificant difference to the file size, we still store the data in a binary format but we specify exactly what that format is within the archive to ensure #2 is satisfied.
Finally, because JSON is free-form, adding new attributes is easy, satisfying #3.
DTM Files
Provenance
To give an example of this in action, the first file format to be converted into an x-file is our .dtm
file format, becoming .dtmx
. Below is the first part of the
metadata.json
for a DTMX file that was originally imported from OBJ format and then saved again as a .dtmx
:
|
|
The "provenance"
array contains the history of the data. The first entry shows where the original data came from — in this case, an OBJ file
named Convergent.obj
with the given SHA-256 hash. If someone wishes to confirm that they have the same OBJ file that was originally loaded to create this DTMX file,
they can use a standard command certutil
that comes with Windows to generate the SHA-256 hash and confirm that the values are the same:
|
|
It is statistically impossible for a different file to have the same hash, so this confirms that we are looking at the same file. (Note that the filename is not actually part of the check, it’s just recorded for convenience; the hash value, which is based entirely on the contents of the file, is actually what determines whether it’s the same file or not.)
All externally-sourced data will have its hash recorded. Input files generated by our software will have their signature stored instead.
The first entry also shows what version of the software was used and the user who loaded it.
The second entry in the provenance history shows who created the DTMX file and when they saved it. The next time the file is loaded, its signature will be verified and this entry will be updated with the signature as well as a note to say that it was verified:
|
|
A new entry will be added to the provenance when it is saved again similar to this one. As long as the final file’s signature is verified, and every entry in the provenance was recorded as verified by the software when it was loaded, every link in the chain can be guaranteed.
Decoding
The metadata.json
file itself is intended to be self-explanatory, although we still intend to document all of the fields and their default values (so they can be omitted if they still have the default
value). However, DTMX files can contain many millions of points and triangles and so these are stored in binary form for efficiency:
|
|
Each time one of our new file formats needs to store data in binary form, the precise format of that data and the name of the file within the archive that contains that data will be recorded.
For example, the triangles
file is a packed array of 363,574 ACTriangleIndex3D
items that be read with the following C++ data type:
|
|
And TriangleType
is a 1-byte entry
|
|
We plan to have a document stored in the archive recording all of the binary data types used in the archive with sample data structures like the above to explain how to decode it.
Flexibility
This approach will make it much easier to evolve the file formats over time, although backwards compatiblity (being able to open old files) is easier to achieve than forwards compatiblity (old versions of the software being able to open new files).
For the JSON components it’s easy to add extra fields and give default values to be used when reading an old file that does not contain them.
For the binary components we simply define an alternative data structure and record that inside the file. In the future, for example, it’s conceivable that we might need to support more than
4 billion points and/or triangles in a single file. We could define an ACTriangleIndex3D64
file type like so:
|
|
And then optionally use it for really large DTMs. (The reason not to use it unless required is because the above struct is 56 bytes per triangle rather than 28; that would add 120 GB of overhead for a 4-billion-triangle DTM!)
Can I Use These Formats?
Yes! One of the points of making the file formats public and easy to replicate is so that third-party software can both read and write files in these formats.
The only problem is that if the file was generated by another package, it won’t be signed unless we provide the developer with a unique key. If it’s not signed then our software will not
set the "verified"
field to true
as it has no way of knowing if the history is authentic.
However, it can still record the hash of the file it read so at least the source file itself can be certified.
Future Work
This is still a work in progress, and we still have many file formats to convert. Our intention is to bump the software version to 3.0 once complete. Feel free to get in touch if you have any feedback!