This guide covers the setup and maintenance of datasets for self-hosted CheckTick instances.

Overview

CheckTick provides three types of datasets for dropdown questions:

  1. NHS Data Dictionary - Standardized medical codes (scraped from NHS DD website)
  2. RCPCH NHS Organisations - Organizational data (synced from RCPCH API)
  3. User-Created - Custom lists created by organizations

All datasets are stored in the database for fast access and offline capability.

Initial Setup

Run these commands once when first setting up your CheckTick instance.

1. Sync NHS Data Dictionary Datasets

Create NHS DD dataset records and scrape initial data in one command:

# Create datasets and scrape data from NHS DD website (takes 1-2 minutes)
docker compose exec web python manage.py sync_nhs_dd_datasets

This creates 48 NHS DD datasets including:

  • Main Specialty Code (75 options)
  • Treatment Function Code (73 options)
  • Ethnic Category (17 options)
  • Smoking Status Code (6 options)
  • Clinical Frailty Scale (9 options)
  • Plus 40+ additional standardized lists

See the NHS DD Dataset Reference for the complete list.

2. Sync External API Datasets

Fetch organizational data from RCPCH API (creates datasets on first run):

# Fetch data from RCPCH API (takes 2-3 minutes, creates datasets automatically)
docker compose exec web python manage.py sync_external_datasets

This creates and populates 7 datasets:

  • Hospitals (England & Wales) - ~500 hospitals
  • NHS Trusts - ~240 trusts
  • Welsh Local Health Boards - 7 boards
  • London Boroughs - 33 boroughs
  • NHS England Regions - 7 regions
  • Paediatric Diabetes Units - ~175 units
  • Integrated Care Boards - 42 ICBs

Scheduled Synchronization

CheckTick uses two automated cron jobs to keep datasets up-to-date:

  1. NHS Data Dictionary Scraping - Scrapes NHS DD website for standardized codes
  2. External API Sync - Syncs organizational data from RCPCH API

Both commands automatically create dataset records on first run, then update them on subsequent runs. No separate seeding commands needed.

NHS Data Dictionary Sync

Recommended schedule: Weekly (Sundays at 5 AM UTC)

What it does:

  • Reads dataset list from docs/nhs-data-dictionary-datasets.md
  • Creates any new dataset records (if added to markdown)
  • Scrapes NHS DD website for each dataset
  • Updates options with latest codes and descriptions
0 5 * * 0 cd /app && python manage.py sync_nhs_dd_datasets

Northflank setup:

  1. Create a new Cron Job service
  2. Configure:
  3. Name: checktick-nhs-dd-sync
  4. Schedule: 0 5 * * 0 (weekly)
  5. Command: python manage.py sync_nhs_dd_datasets
  6. Copy environment variables from web service
  7. Deploy

See Self-hosting Scheduled Tasks for full setup details.

External API Sync

Recommended schedule: Daily (4 AM UTC)

What it does:

  • Fetches latest organizational data from RCPCH API
  • Updates hospitals, trusts, health boards, etc.
  • Increments version numbers for change tracking
0 4 * * * cd /app && python manage.py sync_external_datasets

Northflank setup:

  1. Create a new Cron Job service
  2. Configure:
  3. Name: checktick-dataset-sync
  4. Schedule: 0 4 * * * (daily)
  5. Command: python manage.py sync_external_datasets
  6. Copy environment variables from web service
  7. Deploy

See Self-hosting Scheduled Tasks for full setup details.

Management Commands

sync_nhs_dd_datasets

Combined seed + scrape command - Reads dataset definitions from docs/nhs-data-dictionary-datasets.md, creates/updates records, and scrapes data from NHS DD website.

# Sync all datasets (create records + scrape data)
python manage.py sync_nhs_dd_datasets

# Sync a specific dataset
python manage.py sync_nhs_dd_datasets --dataset smoking_status_code

# Force re-scrape all datasets
python manage.py sync_nhs_dd_datasets --force

# Preview what would be synced (dry-run)
python manage.py sync_nhs_dd_datasets --dry-run

Options:

  • --dataset KEY - Sync only a specific dataset
  • --force - Re-scrape even if recently updated (default: skips if scraped within 7 days)
  • --dry-run - Preview changes without saving

What it does:

  1. Reads docs/nhs-data-dictionary-datasets.md and creates/updates dataset records
  2. Fetches HTML from NHS DD website for each dataset
  3. Parses tables/lists to extract codes and descriptions
  4. Updates dataset options in database
  5. Records last_scraped timestamp

Example output:

๐Ÿ“Š Found 48 dataset(s) to process

  Fetching: https://www.datadictionary.nhs.uk/data_elements/smoking_status_code.html
  Found 6 items
โœ“ Scraped: Smoking Status Code

  Fetching: https://www.datadictionary.nhs.uk/data_elements/ethnic_category.html
  Found 17 items
โ†ป Updated: Ethnic Category

============================================================
โœ“ Successfully scraped: 42
โ†ป Successfully updated: 6
============================================================

When to use:

  • Initial setup (creates datasets from markdown and scrapes them)
  • Scheduled weekly sync
  • After NHS DD publishes updates
  • Manual refresh of specific dataset

sync_external_datasets

Sync external datasets from RCPCH API. Automatically creates dataset records if they don't exist.

# Sync all external datasets
python manage.py sync_external_datasets

# Sync a specific dataset
python manage.py sync_external_datasets --dataset hospitals_england_wales

# Force sync even if recently synced
python manage.py sync_external_datasets --force

# Preview changes without saving
python manage.py sync_external_datasets --dry-run

Options:

  • --dataset KEY - Sync only a specific dataset
  • --force - Bypass sync frequency check
  • --dry-run - Preview without saving

What it does:

  1. Creates dataset records if they don't exist (first run)
  2. Fetches data from RCPCH API
  3. Transforms into CheckTick format
  4. Updates dataset options in database
  5. Records last_synced_at timestamp and increments version

When to use:

  • Initial setup (creates and populates datasets)
  • Scheduled daily sync
  • Manual refresh when RCPCH API updates

Example output:

Syncing 7 external datasets...

โœ“ Synced: Hospitals (England & Wales) - 487 options (version 2)
โœ“ Synced: NHS Trusts - 238 options (version 2)
โœ“ Synced: Welsh Local Health Boards - 7 options (version 2)
โŠ Skipped: London Boroughs (synced 2 hours ago, next sync in 22 hours)
...

Summary:
โœ“ Synced: 5
โŠ Skipped: 2
โœ— Errors: 0

When to use:

  • Initial setup (creates and populates datasets)
  • Scheduled daily sync
  • Manual refresh when API data changes

Configuration

Environment Variables

RCPCH API Configuration

# Optional: Override RCPCH API URL
EXTERNAL_DATASET_API_URL=https://api.rcpch.ac.uk/nhs-organisations/v1

# Optional: Add API key if required in future
EXTERNAL_DATASET_API_KEY=your_api_key_here

Defaults:

  • EXTERNAL_DATASET_API_URL: https://api.rcpch.ac.uk/nhs-organisations/v1
  • EXTERNAL_DATASET_API_KEY: Not required (public API)

Sync Frequency

Configure in dataset model (via Django admin or database):

# sync_frequency_hours field (default: 24)
dataset.sync_frequency_hours = 24  # Daily sync
dataset.save()

Database Schema

DataSet Model Fields

Key fields for dataset management:

# Identity
key = CharField(max_length=255, unique=True)
name = CharField(max_length=255)
description = TextField(blank=True)
category = CharField(choices=[...])  # nhs_dd, rcpch, external_api, user_created

# Source tracking
source_type = CharField(choices=[...])  # manual, api, imported, scrape
reference_url = URLField(blank=True)  # Source URL for NHS DD datasets
api_endpoint = CharField(blank=True)  # API endpoint for external datasets

# Options storage
options = JSONField(default=dict)  # Key-value pairs

# Sync metadata
last_synced_at = DateTimeField(null=True)  # For API datasets
last_scraped = DateTimeField(null=True)  # For NHS DD datasets
sync_frequency_hours = IntegerField(default=24)
version = IntegerField(default=1)

# Sharing
is_custom = BooleanField(default=False)
is_global = BooleanField(default=False)
parent = ForeignKey('self', null=True)  # For custom versions
organization = ForeignKey(Organization, null=True)

# Discovery
tags = JSONField(default=list)

Troubleshooting

NHS DD Scraping Issues

Problem: Scraper can't find options on NHS DD page

โœ— Error scraping Smoking Status Code: No valid options found on the page

Solutions:

  1. Check if NHS DD page structure changed:

bash curl https://www.datadictionary.nhs.uk/data_elements/smoking_status_code.html

  1. Update scraper parsing strategies in sync_nhs_dd_datasets.py

  2. Report issue to development team

Problem: HTTP errors when fetching NHS DD pages

โœ— Error scraping: HTTPError 503 Service Unavailable

Solutions:

  1. Wait and retry (NHS DD might be temporarily down)
  2. Check NHS DD website status
  3. Run with --force to retry specific datasets

External API Sync Issues

Problem: RCPCH API connection errors

โœ— Error syncing: ConnectionError

Solutions:

  1. Check RCPCH API status: https://api.rcpch.ac.uk/
  2. Verify EXTERNAL_DATASET_API_URL environment variable
  3. Check firewall/proxy settings
  4. Retry with --force

Problem: API rate limiting

โœ— Error syncing: 429 Too Many Requests

Solutions:

  1. Reduce sync frequency
  2. Stagger sync commands (don't run all at once)
  3. Contact RCPCH for rate limit increase

Performance

Problem: Syncing takes too long

Solutions:

  1. Sync specific datasets instead of all:

bash python manage.py sync_external_datasets --dataset hospitals_england_wales

  1. Increase worker timeout for cron jobs

  2. Run syncs during low-traffic periods

Monitoring

Check Dataset Status

Via Django admin:

  1. Navigate to /admin/surveys/dataset/
  2. Filter by category or source_type
  3. Check last_synced_at / last_scraped timestamps
  4. Review version numbers for update history

Via API:

# Get all datasets with sync status
curl https://checktick.example.com/api/datasets-v2/ | jq '.results[] | {key, last_synced_at, last_scraped}'

Audit Logs

Dataset updates are logged in the audit log:

from checktick_app.surveys.models import AuditLog

# Check recent dataset updates
AuditLog.objects.filter(
    action__in=['dataset_synced', 'dataset_scraped']
).order_by('-timestamp')

Developer Guide: Adding New NHS DD Datasets

Process Overview

To add a new NHS Data Dictionary dataset, you only need to add an entry to the markdown table in nhs-data-dictionary-datasets.md. The automated scraping process handles everything else.

Step-by-Step Process

  1. Locate the NHS DD page for the dataset you want to add
  2. Visit NHS Data Dictionary
  3. Find the specific data element or supporting information page
  4. Copy the full URL

  5. Add entry to the markdown table

  6. Open docs/nhs-data-dictionary-datasets.md
  7. Add a new row to the table under "Available NHS DD Datasets"
  8. Format: | Dataset Name | NHS DD URL | Categories | Date Added | Last Scraped | NHS DD Published |

  9. Example entry:

markdown | Patient Discharge Method | [Link](https://www.datadictionary.nhs.uk/data_elements/patient_discharge_method_code.html) | administrative, clinic | 2025-11-16 | Pending | - |

  1. Choosing categories/tags:
  2. Use existing tags for consistency: medical, administrative, demographic, clinic, paediatric, etc.
  3. Separate multiple tags with commas
  4. Keep tags lowercase for consistency

  5. Commit your changes:

bash git add docs/nhs-data-dictionary-datasets.md git commit -m "Add [Dataset Name] to NHS DD datasets" git push

What Happens Next

The automated sync process will:

  1. Detect the new entry in the markdown file
  2. Create a database record for the dataset
  3. Scrape the NHS DD page to extract options
  4. Populate the dataset with codes and descriptions
  5. Make it available to all users immediately

This happens during the next scheduled cron job run (see Scheduled Tasks).

Manual Trigger (Optional)

To immediately sync the new dataset without waiting for the cron job:

# Sync the new dataset (creates record + scrapes data)
docker compose exec web python manage.py sync_nhs_dd_datasets

Scraping Requirements

For successful scraping, the NHS DD page must:

  • โœ… Be a standard data element or supporting information page
  • โœ… Contain a table with codes and descriptions
  • โœ… Use consistent NHS DD table structure
  • โš ๏ธ Pages with non-standard formats may require custom scraping logic

If scraping fails, check the logs:

docker compose logs web | grep "scrape_nhs_dd"

Testing Your Addition

After scraping:

  1. Via Web UI:
  2. Navigate to Datasets page
  3. Filter by nhs_dd source type
  4. Verify your new dataset appears
  5. Check that options are populated correctly

  6. Via Django Admin:

text /admin/surveys/dataset/

  • Find your dataset
  • Verify options field has data
  • Check last_scraped timestamp

  • Via API:

bash curl https://checktick.example.com/api/datasets/?category=nhs_dd

Common Issues

Problem: Dataset created but options are empty

Solution: The scraping logic may need updating for this page's specific HTML structure. Check checktick_app/surveys/management/commands/sync_nhs_dd_datasets.py and add custom handling if needed.

Problem: Duplicate dataset entries

Solution: The seed command is idempotent. It won't create duplicates if a dataset with the same key already exists.

Problem: Dataset not appearing in UI

Solution:

  • Verify is_active=True in database
  • Check that category is set to nhs_dd
  • Ensure is_global=True

Contributing Back

After successfully adding and testing a new dataset:

  1. Update this documentation if you encountered any edge cases
  2. Submit a PR with your changes
  3. Share in GitHub Discussions to let the community know about the new dataset