Solving the Data Format Problem with Daffodil

Solving the Data Format Problem with Daffodil

It goes without saying that to be useful anywhere, data has to be in some sort of format. But every time you start using a new data format, you have to tell your software how to use it, and you have to find ways to convert data into and out of it.

That’s the data format problem, and it gets more complicated every year.

As hardware and software grow more sophisticated, data formats grow more complex and specialized. At the same time, organizations have an increasing need to share data from an increasing variety of sources (for example, aircraft, tanks, ships, and satellites), across platforms and applications. There are dozens of tools available to address specific use cases, but until now, there’s been no generally-accepted way to deal with the variety of specialized data formats in complex computing environments.

The Apache Daffodil (incubating) project has been working to solve the data format problem by standardizing and unifying the capabilities of existing tools. The project is based on the Data Format Description Language (DFDL) which was developed by the Open Grid Project. DFDL is not a data format itself, but a standard framework for describing the attributes of any data format.

DFDL is used to create a schema that defines the elements and properties that make up a particular data format. That schema can be used to parse a file or data stream and create an “infoset,” which breaks the data into standard, identifiable elements easily accessed as XML or JSON. This infoset can be “unparsed” back into the DFDL-described format.

(For a deeper dive into how DFDL works, check out this ApacheCon presentation from Owl’s Mike Beckerle, who’s also a member of the Daffodil team.)

Dozens of DFDL schemas have already been developed for common (and uncommon) data formats, and many more are in the works. Some are openly available, others can be commercially licensed, and organizations can also create their own DFDL schemas.

DFDL is expected to become “official” as a final open specification in 2021, but it’s already a proposed standard, and has been implemented successfully at IBM and the European Space Agency, as well as the Apache Daffodil project.

Solving Security Problems with DFDL

Here at Owl, we’ve integrated Daffodil into our cross domain solutions (CDSs), providing new content filtering capabilities, and enabling users to more safely and easily integrate different data types into their cross domain data transfers. Among other applications, DFDL is an ideal tool for preventing a “bad data denial of service” attack.

Here’s how the attack is supposed to work:

  • A system or device (for example, a military asset or piece of industrial equipment) on a protected network is only allowed to receive and process data in a few whitelisted data formats (for example, sensor data such as GPS, weather radar, or vehicle position/speed). Any other data is blocked before it reaches the system.
  • An attacker creates a file containing malicious data, disguises it as one of the whitelisted data formats, and attempts to send it to the targeted system.
  • If the bad data reaches the targeted system, the system could crash, hang, or require restarting.

In this way, bad data can effectively disrupt critical operations.

And here’s what happens if the targeted system is protected by a CDS with DFDL-based filtering capabilities:

  • To reach the targeted system, the bad data must first pass from the network where it originated (for example, the internet) through the CDS. Every file that passes through the CDS is inspected to validate that it conforms to one of the whitelisted data formats.
  • Rather than simply looking at the file extension or metadata, using the appropriate DFDL schema, the CDS parses the data and attempts to create an infoset from it. To pass the filter, the data must be parsed into elements defined in the DFDL schema, and each element must use the correct data type (integer, text string, etc.) and is validated to have a sensible value (for example, reasonable speed or latitude/longitude values).
  • This parsing ensures that nothing can end up in the data that does not fit the specification of the format. Bad data will not conform to the DFDL schema and will be blocked. Data that’s validated to be in a whitelisted format is then unparsed to its original state and allowed to move on to the next filter on its way through the CDS to the destination system (assuming it passes all filters).

If you’re interested in learning more about the DFDL specification—especially if you’re a developer who’d like to get involved with the project—the Apache Foundation’s Daffodil project page has all the info.

To learn more about how your organization can use DFDL-based filtering to protect critical assets, submit a contact form, or start a chat with a representative now.

Scott Coleman Vice President of Marketing

7 Myths About Data Diodes

Not sure what you should believe about data diodes? It’s no surprise—manufacturers of “unidirectional gateways” and other inferior technologies have done their best to create c...
February 19, 2021
Brian Romansky Chief Innovation Officer

A New Paradigm: OT Security and Data in the Cloud

Many industries have seen significant improvements in operational efficiency and reduced downtime by adopting advanced analytics and optimization algorithms that run on cloud services. Po...
February 16, 2021

The Oldsmar Water System Attack: What It Can Teach Us

Last week’s attack on the Oldsmar, Florida, water system demonstrated that critical infrastructure operations are under constant threat of cyber attacks, and that a successful attack ca...
February 10, 2021