Solving the Data Format Problem with Daffodil

Solving the Data Format Problem with Daffodil


It goes without saying that to be useful anywhere, data has to be in some sort of format. But every time you start using a new data format, you have to tell your software how to use it, and you have to find ways to convert data into and out of it.

That’s the data format problem, and it gets more complicated every year.

As hardware and software grow more sophisticated, data formats grow more complex and specialized. At the same time, organizations have an ever-increasing need to share data across different platforms and applications. There are dozens of tools available to address specific use cases, but until now, there’s been no generally-accepted way to uniformly deal with the variety of specialized data formats in complex computing environments.

The Apache Daffodil (incubating) project has been working to solve the data format problem by standardizing and unifying the capabilities of existing tools. The project is based on the Data Format Description Language (DFDL) which was developed by the Open Grid Project. DFDL is not a data format itself, but a standard framework for describing the attributes of any data format.

DFDL is used to create a schema that defines the elements and properties that make up a particular data format. That schema can be used to parse a file or data stream and create an “infoset,” which breaks the data into standard, identifiable elements easily accessed as XML or JSON. This infoset can be “unparsed” back into the DFDL-described format.

(For a deeper dive into how DFDL works, check out this ApacheCon presentation from Owl’s Mike Beckerle, who’s also a member of the Daffodil team.)

Dozens of DFDL schemas have already been developed for common (and uncommon) data formats, and many more are in the works. Some are openly available, others can be commercially licensed, and organizations can also create their own DFDL schemas.

DFDL is expected to become “official” as a final open specification in 2021, but it’s already a proposed standard, and has been implemented successfully at IBM and the European Space Agency, as well as the Apache Daffodil project.

Solving Security Problems with DFDL

Here at Owl, we’ve integrated Daffodil into our cross domain solutions (CDSs), providing new content filtering capabilities, and enabling users to more safely and easily integrate different data types into their cross domain data transfers. Among other applications, DFDL is an ideal tool for preventing a “bad data denial of service” attack.

Here’s how the attack is supposed to work:

  • A system or device (for example, a military asset or piece of industrial equipment) on a protected network is only allowed to receive and process data in a few whitelisted data formats. Any other data is blocked before it reaches the system.
  • An attacker creates a file containing unusable data, disguises it as one of the whitelisted data formats, and attempts to send it to the targeted system.
  • If the bad data reaches the targeted system, the system could crash, hang, or require restarting.

In this way, bad data can effectively disrupt critical operations.

And here’s what happens if the targeted system is protected by a CDS with DFDL-based filtering capabilities:

  • To reach the targeted system, the bad data must first pass from the network where it originated (for example, the internet) through the CDS. Every file that passes through the CDS is inspected to validate that it conforms to one of the whitelisted data formats.
  • Rather than simply looking at the file extension or metadata, using the appropriate DFDL schema, the CDS parses the data and attempts to create an infoset from it. To pass the filter, the data must be parsed into elements defined in the DFDL schema, and each element must use the correct data type (integer, text string, etc.) and is validated to have a sensible value.
  • This parsing ensures that nothing can end up in the data that does not fit the specification of the format. Bad data will not conform to the DFDL schema and will be blocked. Data that’s validated to be in a whitelisted format is then unparsed to its original state and allowed to move on to the next filter on its way through the CDS to the destination system (assuming it passes all filters).

If you’re interested in learning more about the DFDL specification—especially if you’re a developer who’d like to get involved with the project—the Apache Foundation’s Daffodil project page has all the info.

To learn more about how your organization can use DFDL-based filtering to protect critical assets, submit a contact form, or start a chat with a representative now.

Cross Domain Solutions vs Firewalls

Transferring data securely between networks or systems with different security requirements is one of the fundamental challenges of cybersecurity. For a typical organization, the solution...
November 10, 2020
Device Assessment
Charlie Schick Business Development Manager - Healthcare

Why Do A Medical Device Assessment, Part 5: Cracked Wide Open

This is final post of a five-part series on the what, why, and how we inspect medical devices. In the previous installments, I talked about why we inspect devices, the mindset that gui...
November 3, 2020
Charlie Schick Healthcare Consultant

Why Do A Medical Device Assessment, Part 4: Access Granted

In the last post, we got up close and personal with the device, and now it was time to really try to dig into the administrative functions. While the unauthenticated (non-password-protect...
October 29, 2020