Support for UTF-8 in the HDF5 is essential for modern scientific and data-intensive applications. It ensures data integrity, promotes interoperability, and enables the handling of a wide range of information globally. Scientific data is often processed using various tools and programming languages across different operating systems. As the standard for text on the web and in contemporary operating systems, adopting UTF-8 allows HDF5 to facilitate seamless data exchange between platforms and applications, making data sharing easier and more efficient for the scientific community. This project aims to address high-priority issues related to UTF-8 in the HDF5 library, focusing on fixing bugs associated with UTF-8 filenames and variable-length strings, particularly on Windows. Additionally, enhancements will support UTF-8 in HDF5 tools and enable the introspection of UTF-8 strings within functions and variable-length string properties in the HDF5 library. The work also includes the addition of UTF-8 character testing to the existing Continuous Integration (CI) framework of HDF5. These enhancements will validate HDF5's ability to handle international characters in various components of the library, including filenames. While HDF5 has foundational support for UTF-8, this support is not consistently enforced or tested, which can lead to inconsistencies and bugs, especially when data is transferred across different platforms. The work involves creating a new suite of tests designed explicitly for UTF-8. These tests will account for variations in how operating systems, particularly Windows, handle UTF-8 in filenames and command-line interactions. The new tests will be integrated into the existing GitHub Actions to ensure continuous validation of HDF5's internationalization capabilities. This proactive approach will prevent character encoding issues and enhance the HDF5 library's quality for the global scientific community.

Fund this project