Chokepoint Project

About this project

The Chokepoint Project is a non-profit project to collects, analyzes and reports on data relating to network neutrality and civil rights in the digital domain. Every aspect of the project can conceptually be described as being part of one of three types : Collection, Analysis, or Reporting

About this beta

This beta currenly displays information generated from a number of data sources, such as: Measurement lab, DNShonest, OONI: Open Observatory of Network Interference and the Worldbank. Data from these sources are then used to generate and display a variation of visual analyses. Examples of this are DNS correctness, Tor bridge reachability, and Internet connectivity.

Funding

This project is currently funded by ourselves and the European Instrument for Human Rights and Democracy.

Collection

What to collect ?

Chokepoint aims to collect information in four key areas:

1. Network metrics

Network metrics provide an indication of network performance and capability. Network metrics can range from basic quantitative measurements, to more advanced qualitative statistics as collected by organizations such as Measurement Lab or RIPE NCC .

2. Legal and jurisdictional information

These are sources that supply information about the legislative structures that apply to the network nationally, supra-nationally and globally. An understanding of the legal framework is crucial to understand the significance of network metrics as well as incident reports. Apart from information on legislation, Chokepoint will collect jurisdictional information to evaluate how the law is actually interpreted and applied.

3. Incident reports / Journalistic reports

This is the most diverse and generally least structured type of data. It may incorporate reports from NGO's, individuals, companies, as well as governmental organizations. This kind of information cannot always be quantified in a meaningful way, but is crucial to illustrate the impact of policy and/or interventions on the people that depend on the network for economic, social and business, safety, and well-being.

4. Reference data

These are sources that are used to enrich source data and/or to normalize intermediate analytic results. Examples are geoip lookups or Worldbank data. Reference data helps to facilitate (historical) analyses and to structure the presentation of those analyses.

How to collect ?

Right now, Chokepoint is tailored to processing structured, regularly published data. This means that at this stage of development, Chokepoint only publishes information based on data sources that are published periodically, in a structured (i.e. machine-readable) format.

Incidental data that is collected as a one-time effort over a limited period in time, such as the "internet census", can provide a wealth of information, but is intrinsically limited for the purpose of monitoring developments over time. In addition, such sources frequently cannot guarantee legality, safety and reliability of the information. For this reason, the integration of these types of data has a lower priority at this time.

Work is on-going to incorporate data that is published on a regular basis, but is not (yet) machine-readable. Chokepoint is working with several organizations to set up a workflow to facilitate the incorporation of these types of data. As with incidental data, the integration of these sources has a lower priority at this time.

Issues with collecting and publishing data and information

Not all data sources are created equal. Some sources publish sparingly, others in overwhelming bulk. Some are completely unstructured, others are highly structured but require extensive enrichment and analysis to become meaningful. Partly this is a technical issue, and in that respect there are a number of best practices that can be leveraged to address some of the issues.

The big issues, as we see them, are 1) understanding of the data and information and potential repercussions for individuals who might be exposed by publishing, or even republishing, this data. Merely stating that this is simply a question of "anonimyzing" the data would be both flippant and a gross simplification. This platform can not profess to have sufficiently expert knowledge of every source data set. It is for this reason that anonymization has to take place at the source data owner/publisher before it enters into the system.

Analysis

Before analysis

Before analyses can take place, structured, machine-readable data and unstructured, free-form information needs to be “modeled” and “enriched”. Modeling is the process of structuring raw data and information to facilitate further processing. In the case of unstructured, free-form information, this also entails making the data machine-readable. The next step is to enrich the structured data by combining it with information from other sources. Enrichment might involve the calibration of timestamps, translating network addresses to geographical coordinates, or even just homogenizing the already existing values and descriptors to a common standard, enabling meaningful comparisons different data sets and entities.

What to analyze ?

The type of analysis that can be performed is highly dependent on the nature and quantity of the available data. A large quantity of highly structured data with well-known characteristics, such as network performance metrics, can be analyzed using domain agnostic statistical modeling and numerical analysis. Other types of data will require a more elaborate domain specific framework within which automated analyses can be performed. Work on the development of these frameworks is an ongoing area of focus. Regardless of what type data of data is under consideration, any type of analysis aims towards helping answer three basic questions: a) what has happened where ? b) what is happening there now ? c) can any statement be made as to how 'a' evolved into 'b' ?

How to analyze ?

In the context of the system we are building, "analysis" refers specifically to automated analysis, with the intent to facilitate socio-political analyses by third parties. Chokepoint aims to provide a tool to make sense of a large amount of disparate data and information, by collecting and contextualizing this information across different disciplines. We recognize that automated analyses rarely allow for definitive statements about causality or intent, and we wish to avoid the suggestion of same in the presentation of our results. (This is especially important given that the visualizations we develop, e.g. the presentation of results on a map, while making it easier to navigate and understand the results, also introduces inherent bias and distortion)

Causality v. Correlation

While causality might be tricky or impossible to ascertain based on automated analyses, it does allow for the detection of trends and correlations within large data sets. These trends and correlations, especially when presented in (near) real-time, are extremely valuable leads for research into the underlying causal relationships (if any).

Historic analysis and "RQF" analysis

The integration of a large variety of different data sources poses two problems. First, different sources of data publish different amounts of data at different intervals: data sources that provide information about network performance are generally high-volume and nearly real-time, while data sources for legal and jurisdictional information move at a comparatively much slower rate. Second, there is the problem of information overload: to be practical, the system must be able to distinguish information based on urgency and allow for the rapid integration and contextualization of urgent information (we call this "Really Quite Fast" or RQF analysis). To address these issues, the system is arranged in a layered fashion, where each data sets is first analyzed in isolation, then combined with other data sets for contextual analysis, and finally linked to historical trends and data sets for historical analysis. This results in a cumulative process that allows the system to scale over time in both breadth and depth.

Reporting

A report by any other name should be as clear.

What is a report ?

To be clear, "report" is somewhat of a misnomer, as it suggests a format where someone is presenting conclusions having been preceded by information gathering and analysis, but conclusive statements will rarely come out of this platform. To report in this context simply means "to present extracts of processed data in a predefined format".

What to report?

There are a number of questions we want to answer, and these questions determine what to report:

Is a location connected to the internet ?

Is that connection free from interference ?

What interference exists on that connection ?

What is the legal framework, if any, within which that interference exists ?

What is the jurisdictional practice in that location, does this practice adhere to the legal framework?

Which tools are available in that location to circumvent interference should this exist ?

Are there legal consequences to the use of such tools ?

Which civil rights have been violated in the location ?

Which human rights have been violated in this location ?

What is the development over time of the answers of all the previous questions ?

How to report?

The effectiveness of a report is only as high as its accessibility. Which in turn is fairly dependent on the audience for these "reports". In this context "one size fits no-one" is applicable. The primary point of access however will be an analytic dashboard, reminiscent of Google Analytics. As different people require access to information in different formats we are building 5 different levels of access to this system:

1. Visualization and mapping of analytic results : a dashboard.

2. Access to analytic visualization tools.

3. Access to intermediate and enriched data.

4. Access to raw data.

5. Access to analytic processing code.

For whom to report ?

The short answer is : everyone.

The slightly more specific answer is : Policy makers, judges, lawyers, human rights activist, journalists in particular and civil society as a whole.

The long answer is the '5 estates', being : the Legislative, Executive, Judiciary, News Media and Everyone Else.

There are clear reasons why we feel these are to be consider the "target audience", mostly it is based on the conviction that we have arrived at a point in time where the divide between the "the Expert" and "the Generalist" has blurred to the extent that without generalist context, expert knowledge is isolated at the risk of irrelevance. And, conversely, "generalist" professions, such as traditionally embodied by the four estates are becoming increasingly difficult to exercise competently without "expert" understanding. This has been true to a great extent throughout history, but, we feel, has become of existential importance in these times of "technological upheaval". Particularly problematic have become the difficulties to apply intuitive understanding of how increased technological capacity impacts legislative safeguards prerequisite to the exercise of civil liberties as the practical expression of human rights.