F.A.I.R. Principles in Data for AI

How the FAIR Data Principles apply to Machine Learning Data and Infrastructure

At Hopsworks, the FAIR Guiding Principles for scientific data management and stewardship have been a cornerstone of our approach to build a better machine learning platform. F.A.I.R. principles initially became prevalent in academia and diverse fields of research in an effort to make sure that the ever growing amount of data could still be usable and beneficial for the society, and it has since been widely adopted. However, few people mention them in the context of machine learning systems and data management. Yet those principles are even more relevant today in the fast moving AI and LLMs landscape, where new legislation is changing the rules of the game. 

AI professionals should consider how questions of ethics, data management, and open frameworks may influence their choice of tools and machine learning platforms when implementing modern ML systems. In Hopsworks, we follow the F.A.I.R. principles in the design of a platform for managing machine learning data and infrastructure. 

What are the Four Core Concepts of F.A.I.R.? 

Findable; referring to mechanics to make the data easily searchable and findable. Infrastructure, stakeholders, and projects need easy-to-use functionality for data discovery. 

  • Data needs to follow clear naming conventions, be indexed for free-text search and have persistent uniquely identified metadata that clearly and explicitly describe the data. 
  • The design and curation of metadata needs to have good system support.

Accessible; allow access not only to the data but the provenance of the data and metadata for the data. 

  • Open, free, and universally implementable protocols that allow access to the data itself, the metadata and its provenance,
  • Access control support is required when sharing data. Role-based access control is good, but attribute-based access control and/or dynamic role-based access control provides even more fine-grained support for data sharing and reuse.

Interoperable; data should be easily shared between different computer systems. This is achieved by implementing open standards and formats for data

  • Open and accessible file formats and transport protocols for accessing the data. 

Reusable; data produced by one system should be easy to reuse in downstream systems, without copying the data. In order to reuse data, it’s important to include metadata related to the data licenses,, provenance, community standards, and custom metadata that will allow other institutions, teams or groups to be able to reuse the data.

  • Versioning, cataloging, provenance/lineage, data integrity, and custom metadata make it easier for users of data to decide on whether they can use the shared data.

Why F.A.I.R. is challenging for AI platforms and ML Systems

Some of the FAIR principles are directly applicable in the context of machine learning systems: there are lots of open source frameworks, file systems, and programming languages that are used for the operation of AI products and services. Still, some very serious challenges do emerge that are specifically due to the way any ML System needs to operate. 

Findable; while strategies that apply metadata and clear nomenclature can be applied in the context of operational machine learning systems, practitioners will find it challenging to create a clear centralized logic between the different data sources and databases needed to operate such services; a modern ML system might need to be connected to multiple sources, some of which may be real-time, or vector databases for large languages. Making a clear structure for the assets and the metadata becomes a complex endeavor without a centralized solution capable of catering to the different scenarios. 

Interoperable & Accessible; When open frameworks and open file formats are used; core challenges in regards to accessibility and interoperability should be easier to resolve; in which case it becomes important to consider open standards, compute engines and avoid DSLs. One additional challenge that can span from the very nature of the underlying data is to make it accessible for auditing (for example; what was the data that the model in production last year trained on?), review and debugging whilst the systems continuously updates and appends data.

Reusability; Finally a fundamental characteristic of machine learning models is that some of them require the data processing to be directly tied to the model that will be trained; we call these model-dependent transformations. This process essentially compromises the integrity of the data and the underlying datasets can’t be re-used in a different scenario. And not only does it prevent the reuse of the data itself, it is also harder to understand for a human. This leads to significant holds on the ability of any organization to reuse their data in different models, leading to deduplication and the creation of monolithic pipelines that are notoriously harder to scale from. 

Making Data for AI F.A.I.R.

Use case ofHopsworks with the Human Exposome Assessment Platform

At Hopsworks, we have a strong heritage working with academia and research, participating in projects such as HEAP (Human Exposome Assessment Project) that manages personal data from numerous medical institutes across the world. We have always been mindful of the evident privacy and security concerns and needs of efficiency in managing data following FAIR principles; when approaching such project; we consider those principles as a blueprint on how to refine our own software; 

  • Using open frameworks, 
  • Using open languages,
  • Modular technologies, 
  • Reusable file formats.

Additionally, striving to build strong abstractions and APIs that enable users and organizations to have a better understanding of the models they are building and more flexibility in reusing their data pipelines. Those are core aspects of the Hopsworks platform, which we believe all state-of-the-art ML  platforms should follow to be within the FAIR framework.

FAIR principles in practice at Hopsworks

FAIR principles at Hopsworks

Sources;

Hopsworks logo square
Lex Avstreikh
Head of Strategy at Hopsworks | Website | + posts

Hopsworks is part of OVHcloud's ecosystem. They are a Swedish data-intensive AI Lakehouse platform; offering capabilities to manage machine learning data and models at scale.