Citations
Abstract
INTRODUCTION: Annual psychiatric admission records compiled by Statistics Norway (Norges Offisielle Statistikk) between 1872 and 1929 represent a rare, systematically collected archive of Norwegian mental health history. This project builds directly on a recently published data descriptor (Hegvik et al., 2025) that introduced a structured dataset of historical mental health incidence in Norway using optical character recognition and manual curation. The present pipeline was developed independently and in parallel, extending that work through an AI-assisted extraction approach and an expanded institution and diagnostic taxonomy. AIM: To develop a reproducible, multi-stage data pipeline that transforms scanned historical psychiatric admission tables into a validated, tidy dataset and to characterize the resulting data in terms of institutional coverage, diagnostic taxonomy, temporal scope, and sex distribution. METHODS: A three-stage pipeline was developed. First, a large language model (Gemini 2.0 Flash) extracted the structural skeleton of each annual table from scanned PDFs using a standardized prompt. Second, all numeric values and footnote markers were entered manually via a custom R Shiny validation application. Third, a final R script standardized naming conventions, removed aggregate rows, and exported a single long- format CSV. i RESULTS: The pipeline produced a dataset of 58,302 admission records spanning 58 years (1872–1929), 30 institutions, and 51 diagnostic labels. Dementia (n=17,919), Melancholia (n=10,674), and Mania (n=6,858) were the most frequently recorded diagnoses. Male admissions accounted for 51.9% of the overall total. Annual admissions grew from 550 in 1872 to 1,940 in 1929. DISCUSSION: This dataset constitutes a complementary structured compilation of Norwegian psychiatric admission statistics for this era and demonstrates that AI-assisted extraction combined with systematic human validation can recover high-quality data from degraded historical sources. The dataset and pipeline code are openly available at https://github.com/jakebharmon/norge-historical-data.
