Empirical analysis of Zipf's law, power law, and lognormal distributions in medical discharge reports.
- Elsevier BV
- Publication Type:
- Journal Article
- International journal of medical informatics, 2020, 145, pp. 104324
- Issue Date:
|1-s2.0-S1386505620310984-main.pdf||Published version||1.16 MB|
Copyright Clearance Process
- Recently Added
- In Progress
- Closed Access
This item is closed access and not available.
BACKGROUND:Bayesian modelling and statistical text analysis rely on informed probability priors to encourage good solutions. OBJECTIVE:This paper empirically analyses whether text in medical discharge reports follow Zipf's law, a commonly assumed statistical property of language where word frequency follows a discrete power-law distribution. METHOD:We examined 20,000 medical discharge reports from the MIMIC-III dataset. Methods included splitting the discharge reports into tokens, counting token frequency, fitting power-law distributions to the data, and testing whether alternative distributions-lognormal, exponential, stretched exponential, and truncated power-law-provided superior fits to the data. RESULT:Discharge reports are best fit by the truncated power-law and lognormal distributions. Discharge reports appear to be near-Zipfian by having the truncated power-law provide superior fits over a pure power-law. CONCLUSION:Our findings suggest that Bayesian modelling and statistical text analysis of discharge report text would benefit from using truncated power-law and lognormal probability priors and non-parametric models that capture power-law behavior.
Please use this identifier to cite or link to this item: