Open data is the most crucial missing link to create truly open-source AI. Without sufficient high-quality open data, open-source AI cannot be competitive with closed-source AI models. We are creating a complete open-source ecosystem for data procesing.

While there are several large open datasets through initiatives like our Common Corpus, there are very few open tools for data processing. These tools are the key to unlocking new data sources, enabling comparable capabilities to frontier models. Currently, most of these tools are closed and proprietary.

PleIAs’s Open Data Toolkit consists of tools that enable users to process, filter, and curate datasets for a wide variety of uses, including training LLMs. Our existing tools include OCR correction models, vision-language models for PDF parsing, models for structuring text (e.g. formatting headlines and fixing paragraph structure), data quality filtering, and toxic data detection. We are seeking support to create updated versions of these tools and to create a cohesive library for these tools. This will make these models accessible to users with varying levels of technical training, and will enable them to be easily incorporated into a variety of pipelines.

All our tools are trained on open data, meeting the highest levels of compliance with regulation such as the EU AI Act. Therefore, these tools are suitable for both research and commercial use. Our tools are also small and efficient, making them suitable for application at scale and for users with limited computational resources. Our tools are also multilingual, making them useful and accessible to a wider range of users. These tools will increase the amount of available open data, which contributes to the open-source community. The Open Data Toolkit is part of PleIAs’s goal to push the boundaries of openness in AI beyond just open-weight models to a fully open development pipeline.

Fund this project