A core component of consumer privacy protection is transparency. The wealth of available text about companies’ and institutions’ privacy practices contrasts markedly with our inability to understand digital privacy at scale. Organizations post privacy policies on the web, along with terms of service agreements, cookie policies, and other related documents, to the extent that millions of these documents exist online. Prior research has shown that consumers lack the time and knowledge to read documents about their privacy, motivating a growing community of research to use natural language processing (NLP) to extract information from these documents. However, this community faces growing problems: modern NLP techniques are hungry for large volumes of text data, and efforts to build corpora of privacy policies are both siloed by the responsible teams (i.e., the corpora are built for one-off projects) and they lack the breadth to cover the full range of privacy documents available online.
We propose to build a large-scale, longitudinal, annotated, and searchable resource of privacy policies, terms of service agreements, cookie policies, and other related documents for the privacy research community. This resource, which we name PrivaSeer, will serve three simultaneous roles: (1) a search engine for privacy documents (i.e., privacy policies plus other species of relevant text); (2) a source of corpora for use by the research community; and (3) an API for privacy-enhancing technologies to draw privacy information from on demand. Through our network of committed collaborators, we will build PrivaSeer resources to assist a variety of audiences, including researchers, policymakers, privacy advocates, lawyers, and developers of consumer tools. This project will help realize long-standing goals to make online privacy manageable for consumers, regulators, and others invested in the future of our information society.