Data validation is the process of clarifying the accuracy, integrity and quality of a set of data before it is used.
What Is Data Validation?
Data validation is the process of clarifying the accuracy, integrity and quality of a set of data before it is used. This can apply to all forms of data, for example, specific text, addresses, dates, and more.
The Need for Validity in Web3
Seeing that oracles don’t typically have a built-in, truly decentralized validation feature in place, there’s no saying that the data they provide is true or hasn’t already been manipulated. What can happen, and has happened quite frequently already, is that instead of targeting a protocol directly, an attacker targets the data being sourced by the protocol from an oracle. This is an overall easier way for attackers to manipulate a situation in their favor.
With malicious events like these ceasing to die down, validation solutions are starting to pop up. However, proper data validation is much easier said than done.
Validation Challenges and Inefficiencies
Seeing that each piece of data in the process of executing functions within and across blockchains needs to be validated and kept in sync, properly validating data is more complicated than it may seem.
The easier and most common way to implement data validation is through a centralized server so that just one entity is at the head of deciding whether a piece of data is accurate or not. This helps promote high-speed performance, eliminating the need for reaching consensus across the globe. However, centralization also leaves significant gaps for errors and malicious actors.
If a validation process is centralized, that means there is no incentivization for other actors to check and make sure that the main actor’s work is correct. Also, this means there is only one actor a hacker would need to take over in order to have complete control over the decision-making, whereas with decentralization, it decreases hacking risks seeing that hackers need to take over more than 50% of an entire network of nodes to gain control, and overall, significantly decreases any bias or validation error.
A Decentralized Solution
The fundamental tenet of Web3 is decentralization, which distributes authority, trust and other virtues across network users and stakeholders. Since actions must travel to every corner of the globe, 100% decentralization does cause a small amount of time delay, but when it comes to validating data, decentralization reigns more importance than lightning-fast performance.
In general, to determine if a piece of data is valid, there always needs to be a generic solution, i.e., developers creating custom validation methods per data set. However, what’s lacking is managing these different runtimes and ensuring that all data sets are properly sourced and validated quickly and efficiently.
In each pool, there is a group of nodes, with one randomly selected to be responsible for uploading the data, and the rest accountable for voting on if that data is valid or not. Each vote has a weighted value depending on how many tokens the node stakes. Once the vote is final, the responsibility of uploading the next bundle of data is switched to another randomly selected node. Doing so combats the risk of centralization, i.e., if there is only one node uploading data at all times, that would be a higher risk factor for an attack.
Web3’s data infrastructure and integrity highly rely on the use of truly valid data to ensure a scalable and trustless future. As time goes on and more projects recognize how important data validation is, especially in Web3, there will undoubtedly be more aspects taken into account while validating data. The best we can do is continue building and educating around the topic.
Fabian started his journey as a tech lead in a local ed-tech startup. A hackathon in 2019 started his fascination for Web3, and six months later founding his first successful project ArVerify, an on-chain KYC system that saw big adoption in the Arweave Ecosystem. Shortly after in 2021, he co-founded KYVE, a decentralized Web3 data lake, with John Letey.