Data preprocessing

The fixed_network data pre-processing pipeline reads raw data on the fixed broadband network in the UK, appends this with data for areas where data is not available and transforms this into a set of shapefiles that can be interpetated by the fixed_network model.

File structure

  • data/digital_comms/raw

    Contains an archive of untouched incoming data

  • data/digital_comms/intermediate

    Contains intermediate files, necessary to enable preprocessing data on a cluster

  • data/digital_comms/processed

    Contains the final result that can be read by the fixed_network model

Preprocessing

Local machine option

Step 1

Generate exchange areas, this is necessary to split up the problem in ~5895 units, to be run in a distributed environment. Note that this is a memory extensive process that should be run on a high-memory machine ~120GB of RAM required.

python scripts/network_cluster_input_files.py

If you have no access to such a machine, you can also get the intermediate/exchange_areas folder from a previous job (on the cluster) and put it in you local project.

Step 2

Run pre-processing per exchange area, make sure to give the exchange area as an argument to the script.

python scripts/network_preprocess_input_files.py exchange_EACAM

This generate an intermediate file per exchange_area in processed/exchange_EACAM

Cluster option

This single script generates intermediate/exchange_areas on the host node and then distributes pre-processing jobs over the cluster using GNU_parallel. The exchange areas are not re-generated if they already exist, delete them if you will need to regenerate these.

cd scripts

run_parallel.sh

Results collection

Collect the intermediate results and process this into a single results set in the processed directory. Without arguments the script will collect all the areas that are present in the intermediate folder. With an argument, it will collect data for a certain subset, for example Cambridge, Oxford, Leeds and Newcastle.

python scripts/network_preprocess_collect_results.py python scripts/network_preprocess_collect_results.py Cambridge