• Blog
  • Informational

5-Star Linked Open Elections Data 

In this post, we talk about what data related to elections in Bulgaria contains, why it’s difficult to work with and what we did in order to make it easier to process and better accessible. We have applied semantic data integration to the Bulgarian election data since 2013 and have published it as a harmonized RDF knowledge graph at https://elections.ontotext.com/. By this, we want to facilitate its exploration and analysis and in this way improve the perception of transparency and trust in the elections process.

March 24, 2021 5 mins. read Nikola Tulechki

A consistent electoral process is at the foundation of every modern democracy. Public trust in this process and its fairness is among the most important ingredients of the social capital that makes the democracy function effectively. For these reasons, publishing the data related to elections is obligatory for all EU member states under Directive 2003/98/EC on the re-use of public sector information and the Bulgarian Central Elections Committee (CEC) has released a complete  export of every election database since 2011.

The data shared by the CEC represents the election process at the most granular level. It is a collection of the digital version of the polling station protocols. Each protocol contains the results from the counting of a single ballot box and lists all vote counts, who the votes were for as well as various data points concerning the particular polling station such as its geography, the particular election cycle it concerns, the number of voters, the number of invalid ballots cast, etc. In parallel, of course, the CEC shares the names and numbers of the candidates and their parties.

The complexities of election data

While we, as citizens, may be accustomed to hearing about the end results of a given election and (hopefully) to actually voting, little do we know about the actual complexity of the process on a national level. Without appropriate data publishing and exploration platforms, it is too difficult to comprehend the different levels of aggregation the votes go through or how they are articulated with the administrative territorial divisions of the country. Not being able to co-relate their votes to the final results and to understand the data, some people can get the perception of a lack of transparency, which erodes the trust in the fairness of the elections.

Although the data is comprehensive, it is difficult to process due to various reasons:

  • Given that only the most granular data is available, any meaningful analysis has to be produced at levels of aggregation such as the municipality or the electoral district.
  • The format of the data, especially the files containing the votes, is non-standard and not consistent between elections, thus requiring a separate custom transformation for every election cycle.
  • The mechanisms of identification are also suboptimal. Parties and coalitions are identified by the numbers on the ballots, which are attributed randomly at each election. No identification is provided for the candidates, who are only represented by their three names.

Furthermore, the format of the export and process changes slightly from election to election, making comparing data chronologically almost impossible without substantial data wrangling and ad-hoc cleaning and matching.

Easily accessible linked open elections data

For these reasons, we have applied semantic data integration and produced a coherent knowledge graph covering all Bulgarian elections from 2013 to the present day. These are the counts of the main entities in the graph:

  • 11 election cycles covered;
  • 129 political parties;
  • 113,370 candidates;
  • 137,187 sections;
  • 185,810 voting protocols;
  • 44,478,700 cast ballots;
  • 9,609,780 preferential votes;
  • 53 million statements in total.

These diagrams illustrate how  the data is structured (click to zoom):

The data in the knowledge graph is harmonized along the most important dimensions:

  • Parties are linked across election cycles and across jurisdictions.
  • The administrative territorial hierarchy of Bulgaria is added and linked to Wikidata. This allows not only arbitrary levels of aggregation and chronological comparison of results, but provides a valuable link to the Linked Open Data (LOD) cloud. The availability of Wikidata identifiers for administrative territorial entities enables people to access geographical information (coordinates and polygons) using SPARQL federation as well as allows linking with various other statistical datasets for more in depth analysis of electoral behavior.
  • For several election cycles, the individual addresses of the polling stations are resolved using a geolocalization service and their coordinates are also available for precise mapping.

The data is publicly available as a SPARQL endpoint at https://elections.ontotext.com/. In the back-end the data is hosted in Ontotext’s GraphDB engine. One can explore that data in GraphDB Workbench using its search, graph traversal and visualization facilities. A set of  of sample queries is provided to help the understanding of the data model and shorten the learning curve.

The road ahead

In the future, besides adding data about new election cycles as it becomes available, we are going to work to have even finer-grained geographical information for the polling stations as well as deduplicating the individual candidates, matching and uploading them to the LOD cloud.

As we are not political experts, it is not our ambition to interpret the data and draw conclusions. However, by providing 5-star Linked Open Elections Data, we want this resource to become a go-to source of data about the electoral activity in Bulgaria and to ultimately become a tool that strengthens the public’s trust in the democratic process.

See for yourself!

New call-to-action



Article's content