31 datasets found

Collection of example datasets used for the book - R Programming -...
figshare.com
txt
Updated Dec 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kingsley Okoye; Samira Hosseini (2023). Collection of example datasets used for the book - R Programming - Statistical Data Analysis in Research [Dataset]. http://doi.org/10.6084/m9.figshare.24728073.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24728073.v1
Dataset updated
Dec 4, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Kingsley Okoye; Samira Hosseini
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This book is written for statisticians, data analysts, programmers, researchers, teachers, students, professionals, and general consumers on how to perform different types of statistical data analysis for research purposes using the R programming language. R is an open-source software and object-oriented programming language with a development environment (IDE) called RStudio for computing statistics and graphical displays through data manipulation, modelling, and calculation. R packages and supported libraries provides a wide range of functions for programming and analyzing of data. Unlike many of the existing statistical softwares, R has the added benefit of allowing the users to write more efficient codes by using command-line scripting and vectors. It has several built-in functions and libraries that are extensible and allows the users to define their own (customized) functions on how they expect the program to behave while handling the data, which can also be stored in the simple object system.For all intents and purposes, this book serves as both textbook and manual for R statistics particularly in academic research, data analytics, and computer programming targeted to help inform and guide the work of the R users or statisticians. It provides information about different types of statistical data analysis and methods, and the best scenarios for use of each case in R. It gives a hands-on step-by-step practical guide on how to identify and conduct the different parametric and non-parametric procedures. This includes a description of the different conditions or assumptions that are necessary for performing the various statistical methods or tests, and how to understand the results of the methods. The book also covers the different data formats and sources, and how to test for reliability and validity of the available datasets. Different research experiments, case scenarios and examples are explained in this book. It is the first book to provide a comprehensive description and step-by-step practical hands-on guide to carrying out the different types of statistical analysis in R particularly for research purposes with examples. Ranging from how to import and store datasets in R as Objects, how to code and call the methods or functions for manipulating the datasets or objects, factorization, and vectorization, to better reasoning, interpretation, and storage of the results for future use, and graphical visualizations and representations. Thus, congruence of Statistics and Computer programming for Research.
d
Mental Health and Learning Disabilities Statistics
digital.nhs.uk
csv, pdf, xls
Updated Dec 22, 2015
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2015). Mental Health and Learning Disabilities Statistics [Dataset]. https://digital.nhs.uk/data-and-information/publications/statistical/mental-health-and-learning-disabilities-statistics
Explore at:
csv(13.2 kB), xls(485.4 kB), pdf(179.7 kB), pdf(578.3 kB), csv(7.2 MB), csv(2.4 MB), pdf(98.5 kB), xls(494.6 kB)Available download formats
Dataset updated
Dec 22, 2015
License
https://digital.nhs.uk/about-nhs-digital/terms-and-conditionshttps://digital.nhs.uk/about-nhs-digital/terms-and-conditions
Time period covered
Sep 1, 2015 - Oct 31, 2015
Area covered
England
Description
This statistical release makes available the most recent Mental Health and Learning Disabilities Dataset (MHLDDS) final monthly data (September 2015). This publication presents a wide range of information about care delivered to users of NHS funded secondary mental health and learning disability services in England. The scope of the Mental Health Minimum Dataset (MHMDS) was extended to cover Learning Disability services from September 2014. Many people who have a learning disability use mental health services and people in learning disability services may have a mental health problem. This means that activity included in the new MHLDDS dataset cannot be distinctly divided into mental health or learning disability spells of care - a single spell of care may include inputs from either of both types of service. The Currencies and Payment file that forms part of this release is specifically limited to services in scope for currencies and payment in mental health services and remains unchanged. This information will be of particular interest to organisations involved in delivering secondary mental health and learning disability care to adults and older people, as it presents timely information to support discussions between providers and commissioners of services. The MHLDS Monthly Report also includes reporting by local authority for the first time. For patients, researchers, agencies, and the wider public it aims to provide up to date information about the numbers of people using services, spending time in hospital and subject to the Mental Health Act (MHA). Some of these measures are currently experimental analysis. The Currency and Payment (CaP) measures can be found in a separate machine-readable data file and may also be accessed via an on-line interactive visualisation tool that supports benchmarking. This can be accessed through the related links at the bottom of the page. This release also includes a note about the new experimental data file and the issuing of the ISN for the Mental Health Services Dataset (MHSDS). During summer 2015 we undertook a consultation on Adult Mental Health Statistics, seeking users views on the existing reports and what might usefully be added to our reports when the new version of the dataset (MHSDS) is implemented in 2016. A report on this consultation can be found below. Please note: The Monthly MHLDS Report published in February will cover November final data and December provisional data and will be the last publication from MHLDDS. Data for January 2016 will be published under the new name of Mental Health Services Monthly Statistics, with a first release of provisional data planned for March 2016. A Methodological Change paper describing changes to these monthly reports will be issued in the New Year.
d
MEDLINE/PubMed Baseline Statistics: Min/Max Report
catalog.data.gov
datadiscovery.nlm.nih.gov
+2more
Updated Jun 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Library of Medicine (2025). MEDLINE/PubMed Baseline Statistics: Min/Max Report [Dataset]. https://catalog.data.gov/dataset/2023-medline-pubmed-baseline-min-max-report
Explore at:
Dataset updated
Jun 19, 2025
Dataset provided by
National Library of Medicine
Description
A file containing all Min/Max Baseline Reports for 2005-2023 in their original format is available in the Attachments section below. A second file includes a separate set of reports, made available from 2002-2017, that did not include OLDMEDLINE records. MEDLINE/PubMed annual statistical reports are based upon the data elements in the baseline versions of MEDLINE®/PubMed are available. For each year covered the reports include: total citations containing each element; total occurrences of each element; minimum/average/maximum occurrences of each element in a record; minimum/average/maximum length of a single element occurrence; average record size; and other statistical data describing the content and size of the elements.
m
COVID-19 Combined Data-set with Improved Measurement Errors
data.mendeley.com
Updated May 13, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Afshin Ashofteh (2020). COVID-19 Combined Data-set with Improved Measurement Errors [Dataset]. http://doi.org/10.17632/nw5m4hs3jr.3
Explore at:
Unique identifier
https://doi.org/10.17632/nw5m4hs3jr.3
Dataset updated
May 13, 2020
Authors
Afshin Ashofteh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Public health-related decision-making on policies aimed at controlling the COVID-19 pandemic outbreak depends on complex epidemiological models that are compelled to be robust and use all relevant available data. This data article provides a new combined worldwide COVID-19 dataset obtained from official data sources with improved systematic measurement errors and a dedicated dashboard for online data visualization and summary. The dataset adds new measures and attributes to the normal attributes of official data sources, such as daily mortality, and fatality rates. We used comparative statistical analysis to evaluate the measurement errors of COVID-19 official data collections from the Chinese Center for Disease Control and Prevention (Chinese CDC), World Health Organization (WHO) and European Centre for Disease Prevention and Control (ECDC). The data is collected by using text mining techniques and reviewing pdf reports, metadata, and reference data. The combined dataset includes complete spatial data such as countries area, international number of countries, Alpha-2 code, Alpha-3 code, latitude, longitude, and some additional attributes such as population. The improved dataset benefits from major corrections on the referenced data sets and official reports such as adjustments in the reporting dates, which suffered from a one to two days lag, removing negative values, detecting unreasonable changes in historical data in new reports and corrections on systematic measurement errors, which have been increasing as the pandemic outbreak spreads and more countries contribute data for the official repositories. Additionally, the root mean square error of attributes in the paired comparison of datasets was used to identify the main data problems. The data for China is presented separately and in more detail, and it has been extracted from the attached reports available on the main page of the CCDC website. This dataset is a comprehensive and reliable source of worldwide COVID-19 data that can be used in epidemiological models assessing the magnitude and timeline for confirmed cases, long-term predictions of deaths or hospital utilization, the effects of quarantine, stay-at-home orders and other social distancing measures, the pandemic’s turning point or in economic and social impact analysis, helping to inform national and local authorities on how to implement an adaptive response approach to re-opening the economy, re-open schools, alleviate business and social distancing restrictions, design economic programs or allow sports events to resume.
2013 NSDUH Statistical Inference Report
catalog.data.gov
data.virginia.gov
+1more
Updated Sep 7, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Substance Abuse and Mental Health Services Administration (2025). 2013 NSDUH Statistical Inference Report [Dataset]. https://catalog.data.gov/dataset/2013-nsduh-statistical-inference-report
Explore at:
Dataset updated
Sep 7, 2025
Dataset provided by
Substance Abuse and Mental Health Services Administrationhttps://www.samhsa.gov/
Description
The focus of this report is to describe the statistical inference procedures used to produce design-based estimates as presented in the 2013 detailed tables, the 2013 mental health detailed tables, the 2013 national findings report, and the 2013 mental health findings report. Thestatistical procedures and information found in this report can also be generally applied to analyses based on the public use file as well as the restricted-use file available through the data portal. This report is organized as follows: Section 2 provides background informationconcerning the 2013 NSDUH; Section 3 discusses the prevalence rates and how they were calculated, including specifics on topics such as mental illness, major depressive episode, and serious psychological distress; Section 4 briefly discusses how missing item responses of variables that are not imputed may lead to biased estimates; Section 5 discusses sampling errors and how they were calculated; Section 6 describes the degrees of freedom that were used when comparing estimates; and Section 7 discusses how the statistical significance of differences between estimates was determined. Section 8 discusses confidence interval estimation, and Section 9 describes how past year incidence of drug use was computed. Finally, Section 10 discusses the conditions under which estimates with low precision were suppressed. Appendix A contains examples that demonstrate how to conduct various statistical procedures documented within this report using SAS® and SUDAAN® Software for Statistical Analysis of Correlated Data (RTI International, 2012) along with separate examples using Stata® software.
medlinepubmed-baseline-statistics-misc-report
huggingface.co
Updated Sep 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Health and Human Services (2024). medlinepubmed-baseline-statistics-misc-report [Dataset]. https://huggingface.co/datasets/HHS-Official/medlinepubmed-baseline-statistics-misc-report
Explore at:
Dataset updated
Sep 6, 2024
Dataset provided by
United States Department of Health and Human Serviceshttp://www.hhs.gov/
Authors
Department of Health and Human Services
License
https://choosealicense.com/licenses/odbl/https://choosealicense.com/licenses/odbl/
Description
MEDLINE/PubMed Baseline Statistics: Misc Report

Description

A file containing all Misc Baseline Reports for 2018-2023 in their original format is available in the Attachments section below. MEDLINE/PubMed annual statistical reports are based upon the data elements in the baseline versions of MEDLINE®/PubMed are available. For each year covered the reports include: total citations containing each element; total occurrences of each element; minimum/average/maximum… See the full description on the dataset page: https://huggingface.co/datasets/HHS-Official/medlinepubmed-baseline-statistics-misc-report.
NSDUH 2022 Statistical Inference Report
catalog.data.gov
odgavaprod.ogopendata.com
+1more
Updated Sep 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Substance Abuse and Mental Health Services Administration (2025). NSDUH 2022 Statistical Inference Report [Dataset]. https://catalog.data.gov/dataset/nsduh-2022-statistical-inference-report
Explore at:
Dataset updated
Sep 6, 2025
Dataset provided by
Substance Abuse and Mental Health Services Administrationhttps://www.samhsa.gov/
Description
Learn how to produce basic estimates with the 2022 National Survey on Drug Use and Health (NSDUH). The report describes the techniques that were used to make the 2022 NSDUH Detailed Tables and the 2022 NSDUH Annual National Report, but users may also find these techniques useful for their own research with NSDUH. The report describes the calculation of estimates and sampling errors, degrees of freedom, and the procedures for determining when low-precision estimates should be suppressed. It also includes sample code in several statistical languages that data users can modify to use in their own research.Chapters:Introduction to the report.Background on the survey design, including redesign and questionnaire changes.Prevalence estimates and how they were calculated, including specifics on various topics presented in the detailed tables.Discussion of how missing item responses of variables that are not imputed may lead to biased estimates.Discussion of sampling errors and how they were calculated.Description of degrees of freedom and how they were used to compare estimates.Discussion of how the statistical significance of differences between estimates was determined.Discussion of confidence interval estimation.Discussion of when estimates with low precision were suppressed.Appendix A contains code samples for various statistical procedures documented within the report.
Labour Market Statistics Statistical Bulletin time series dataset - Dataset...
ckan.publishing.service.gov.uk
Updated Aug 30, 2013
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ckan.publishing.service.gov.uk (2013). Labour Market Statistics Statistical Bulletin time series dataset - Dataset - data.gov.uk [Dataset]. https://ckan.publishing.service.gov.uk/dataset/labour-market-statistics-statisticial-bulletin
Explore at:
Dataset updated
Aug 30, 2013
Dataset provided by
CKANhttps://ckan.org/
Description
This is a large dataset which contains the labour market statistics data series published in the monthly Labour Market Statistics Statistical Bulletin. The dataset is overwritten every month and it therefore always contains the latest published data. The Time Series dataset facility is primarily designed for users who wish to customise their own datasets. For example, users can create a single spreadsheet including series for unemployment, claimant count, employment and workforce jobs, rather than extracting the required data from several separate spreadsheets published on the website.
w
Data Use in Academia Dataset
datacatalog.worldbank.org
csv, utf-8
Updated Nov 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Semantic Scholar Open Research Corpus (S2ORC) (2023). Data Use in Academia Dataset [Dataset]. https://datacatalog.worldbank.org/search/dataset/0065200/data_use_in_academia_dataset
Explore at:
utf-8, csvAvailable download formats
Dataset updated
Nov 27, 2023
Dataset provided by
Brian William Stacy
Semantic Scholar Open Research Corpus (S2ORC)
License
https://datacatalog.worldbank.org/public-licenses?fragment=cchttps://datacatalog.worldbank.org/public-licenses?fragment=cc
Description
This dataset contains metadata (title, abstract, date of publication, field, etc) for around 1 million academic articles. Each record contains additional information on the country of study and whether the article makes use of data. Machine learning tools were used to classify the country of study and data use.

Our data source of academic articles is the Semantic Scholar Open Research Corpus (S2ORC) (Lo et al. 2020). The corpus contains more than 130 million English language academic papers across multiple disciplines. The papers included in the Semantic Scholar corpus are gathered directly from publishers, from open archives such as arXiv or PubMed, and crawled from the internet.

We placed some restrictions on the articles to make them usable and relevant for our purposes. First, only articles with an abstract and parsed PDF or latex file are included in the analysis. The full text of the abstract is necessary to classify the country of study and whether the article uses data. The parsed PDF and latex file are important for extracting important information like the date of publication and field of study. This restriction eliminated a large number of articles in the original corpus. Around 30 million articles remain after keeping only articles with a parsable (i.e., suitable for digital processing) PDF, and around 26% of those 30 million are eliminated when removing articles without an abstract. Second, only articles from the year 2000 to 2020 were considered. This restriction eliminated an additional 9% of the remaining articles. Finally, articles from the following fields of study were excluded, as we aim to focus on fields that are likely to use data produced by countries’ national statistical system: Biology, Chemistry, Engineering, Physics, Materials Science, Environmental Science, Geology, History, Philosophy, Math, Computer Science, and Art. Fields that are included are: Economics, Political Science, Business, Sociology, Medicine, and Psychology. This third restriction eliminated around 34% of the remaining articles. From an initial corpus of 136 million articles, this resulted in a final corpus of around 10 million articles.

Due to the intensive computer resources required, a set of 1,037,748 articles were randomly selected from the 10 million articles in our restricted corpus as a convenience sample.

The empirical approach employed in this project utilizes text mining with Natural Language Processing (NLP). The goal of NLP is to extract structured information from raw, unstructured text. In this project, NLP is used to extract the country of study and whether the paper makes use of data. We will discuss each of these in turn.

To determine the country or countries of study in each academic article, two approaches are employed based on information found in the title, abstract, or topic fields. The first approach uses regular expression searches based on the presence of ISO3166 country names. A defined set of country names is compiled, and the presence of these names is checked in the relevant fields. This approach is transparent, widely used in social science research, and easily extended to other languages. However, there is a potential for exclusion errors if a country’s name is spelled non-standardly.

The second approach is based on Named Entity Recognition (NER), which uses machine learning to identify objects from text, utilizing the spaCy Python library. The Named Entity Recognition algorithm splits text into named entities, and NER is used in this project to identify countries of study in the academic articles. SpaCy supports multiple languages and has been trained on multiple spellings of countries, overcoming some of the limitations of the regular expression approach. If a country is identified by either the regular expression search or NER, it is linked to the article. Note that one article can be linked to more than one country.

The second task is to classify whether the paper uses data. A supervised machine learning approach is employed, where 3500 publications were first randomly selected and manually labeled by human raters using the Mechanical Turk service (Paszke et al. 2019).[1] To make sure the human raters had a similar and appropriate definition of data in mind, they were given the following instructions before seeing their first paper:

Each of these documents is an academic article. The goal of this study is to measure whether a specific academic article is using data and from which country the data came.
There are two classification tasks in this exercise:
1. identifying whether an academic article is using data from any country
2. Identifying from which country that data came.
For task 1, we are looking specifically at the use of data. Data is any information that has been collected, observed, generated or created to produce research findings. As an example, a study that reports findings or analysis using a survey data, uses data. Some clues to indicate that a study does use data includes whether a survey or census is described, a statistical model estimated, or a table or means or summary statistics is reported.
After an article is classified as using data, please note the type of data used. The options are population or business census, survey data, administrative data, geospatial data, private sector data, and other data. If no data is used, then mark "Not applicable". In cases where multiple data types are used, please click multiple options.[2]
For task 2, we are looking at the country or countries that are studied in the article. In some cases, no country may be applicable. For instance, if the research is theoretical and has no specific country application. In some cases, the research article may involve multiple countries. In these cases, select all countries that are discussed in the paper.
We expect between 10 and 35 percent of all articles to use data.

The median amount of time that a worker spent on an article, measured as the time between when the article was accepted to be classified by the worker and when the classification was submitted was 25.4 minutes. If human raters were exclusively used rather than machine learning tools, then the corpus of 1,037,748 articles examined in this study would take around 50 years of human work time to review at a cost of $3,113,244, which assumes a cost of $3 per article as was paid to MTurk workers.

A model is next trained on the 3,500 labelled articles. We use a distilled version of the BERT (bidirectional Encoder Representations for transformers) model to encode raw text into a numeric format suitable for predictions (Devlin et al. (2018)). BERT is pre-trained on a large corpus comprising the Toronto Book Corpus and Wikipedia. The distilled version (DistilBERT) is a compressed model that is 60% the size of BERT and retains 97% of the language understanding capabilities and is 60% faster (Sanh, Debut, Chaumond, Wolf 2019). We use PyTorch to produce a model to classify articles based on the labeled data. Of the 3,500 articles that were hand coded by the MTurk workers, 900 are fed to the machine learning model. 900 articles were selected because of computational limitations in training the NLP model. A classification of “uses data” was assigned if the model predicted an article used data with at least 90% confidence.

The performance of the models classifying articles to countries and as using data or not can be compared to the classification by the human raters. We consider the human raters as giving us the ground truth. This may underestimate the model performance if the workers at times got the allocation wrong in a way that would not apply to the model. For instance, a human rater could mistake the Republic of Korea for the Democratic People’s Republic of Korea. If both humans and the model perform the same kind of errors, then the performance reported here will be overestimated.

The model was able to predict whether an article made use of data with 87% accuracy evaluated on the set of articles held out of the model training. The correlation between the number of articles written about each country using data estimated under the two approaches is given in the figure below. The number of articles represents an aggregate total of
San Francisco Incident Reports (2018-present)
kaggle.com
zip
Updated Nov 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vivo Vinco (2023). San Francisco Incident Reports (2018-present) [Dataset]. https://www.kaggle.com/datasets/vivovinco/san-francisco-incident-reports-2018present
Explore at:
zip(71121901 bytes)Available download formats
Dataset updated
Nov 26, 2023
Authors
Vivo Vinco
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
San Francisco
Description
Context

This dataset contains the San Francisco Police Department’s (SFPD) incident reports from 2018 to present. The dataset will be updated daily.

Content

+500.000 rows and 34 columns. Columns' description are listed below.

Incident Datetime : The date and time when the incident occurred.

Incident Date : The date when the incident occurred.

Incident Time : The time when the incident occurred.

Incident Year : The year when the incident occurred.

Incident Day of Week : The day of week the incident occurred.

Report Datetime : The date and time when the report was filed.

Row ID : A unique identifier for each row of data in the dataset.

Incident ID : This is the system generated identifier for incident reports. Incident IDs and Incident Numbers both uniquely identify reports, but Incident Numbers are used when referencing cases and report documents.

Incident Number : The number issued on the report, sometimes interchangeably referred to as the Case Number. This number is used to reference cases and report documents.

CAD Number : The Computer Aided Dispatch (CAD) is the system used by the Department of Emergency Management (DEM) to dispatch officers and other public safety personnel. CAD Numbers are assigned by the DEM system and linked to relevant incident reports (Incident Number). Not all Incidents will have a CAD Number. Those filed online via Coplogic (refer to “Filed Online” field) and others not filed through the DEM system will not have CAD Numbers.

Report Type Code : A system code for report types, these have corresponding descriptions within the dataset.

Report Type Description : Initial, Initial Supplement, Vehicle Initial, Vehicle Supplement, Coplogic Initial or Coplogic Supplement

Filed Online : “TRUE” or left blank.

Incident Code : Incident Codes are the system codes to describe a type of incident. A single incident report can have one or more incident types associated.

Incident Category : A category mapped on to the Incident Code used in statistics and reporting.

Incident Subcategory : A subcategory mapped to the Incident Code that is used for statistics and reporting.

Incident Description : The description of the incident that corresponds with the Incident Code.

Resolution : Cite or Arrest Adult, Cite or Arrest Juvenile, Exceptional Adult, Exceptional Juvenile, Open or Active or Unfounded

Intersection : The 2 or more street names that intersect closest to the original incident separated by a backward slash.

CNN : The unique identifier of the intersection for reference back to other related basemap datasets.

Police District : The Police District where the incident occurred.

Analysis Neighborhood : This field is used to identify the neighborhood where each incident occurs.

Supervisor District : There are 11 members elected to the Board of Supervisors in San Francisco, each representing a geographic district. The districts are numbered 1 through 11.

Latitude : The latitude coordinate in WGS84.

Longitude : The longitude coordinate in WGS84.

Point : Geolocation in OGC WKT format.

Neighborhoods : undefined

ESNCAG - Boundary File : undefined

Central Market/Tenderloin Boundary Polygon - Updated : undefined

Civic Center Harm Reduction Project Boundary : undefined

HSOC Zones as of 2018-06-05 : undefined

Invest In Neighborhoods (IIN) Areas : undefined

Current Supervisor Districts : undefined

Current Police Districts : undefined

Acknowledgements

Data from DataSF. Image from Thales.

If you're reading this, please upvote.
NCHS mortality data 2014-2022
zenodo.org
bin
Updated Jul 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Weinberger Daniel; Weinberger Daniel (2024). NCHS mortality data 2014-2022 [Dataset]. http://doi.org/10.5281/zenodo.12808102
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.12808102
Dataset updated
Jul 24, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Weinberger Daniel; Weinberger Daniel
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is a database (parquet format) containing publicly available multiple cause mortality data from the US (CDC/NCHS) for 2014-2022. Not all variables are included on this export. Please see below for restrictions on the use of these data imposed by NCHS. You can use the arrow package in R to open the file. See here for example analysis; https://github.com/DanWeinberger/pneumococcal_mortality/blob/main/analysis_nongeo.Rmd . For instance, save this file in a folder called "parquet3":

library(arrow)

library(dplyr)

pneumo.deaths.in <- open_dataset("R:/parquet3", format = "parquet") %>% #open the dataset
filter(grepl("J13|A39|J181|A403|B953|G001", all_icd)) %>% #filter to records that have the selected ICD codes
collect() #call the dataset into memory. Note you should do any operations you canbefore calling 'collect()" due to memory issues

The variables included are named: (see full dictionary:https://www.cdc.gov/nchs/nvss/mortality_public_use_data.htm)

year: Calendar year of death

month: Calendar month of death

age_detail_number: number indicating year or part of year; can't be interpreted itself here. see agey variable instead

sex: M/F

place_of_death:

Place of Death and Decedent’s Status
Place of Death and Decedent’s Status
1 ... Hospital, Clinic or Medical Center
- Inpatient
2 ... Hospital, Clinic or Medical Center
- Outpatient or admitted to Emergency Room
3 ... Hospital, Clinic or Medical Center
- Dead on Arrival
4 ... Decedent’s home
5 ... Hospice facility
6 ... Nursing home/long term care
7 ... Other
9 ... Place of death unknown

all_icd: Cause of death coded as ICD10 codes. ICD1-ICD21 pasted into a single string, with separation of codes by an underscore

hisp_recode: 0=Non-Hispanic; 1=Hispanic; 999= Not specified

race_recode: race coding prior to 2018 (reconciled in race_recode_new)

race_recode_alt: race coding after 2018 (reconciled in race_recode_new)

race_recode_new:

1='White'

2= 'Black'

3='Hispanic'

4='American Indian'

5='Asian/Pacific Islanders'

agey:

age in years (or partial years for kids <12months)

https://www.cdc.gov/nchs/data_access/restrictions.htm

Please Read Carefully Before Using NCHS Public Use Survey Data

The National Center for Health Statistics (NCHS), Centers for Disease Control and Prevention (CDC), conducts statistical and epidemiological activities under the authority granted by the Public Health Service Act (42 U.S.C. § 242k). NCHS survey data are protected by Federal confidentiality laws including Section 308(d) Public Health Service Act [42 U.S.C. 242m(d)] and the Confidential Information Protection and Statistical Efficiency Act or CIPSEA [Pub. L. No. 115-435, 132 Stat. 5529 § 302]. These confidentiality laws state the data collected by NCHS may be used only for statistical reporting and analysis. Any effort to determine the identity of individuals and establishments violates the assurances of confidentiality provided by federal law.

Terms and Conditions

NCHS does all it can to assure that the identity of individuals and establishments cannot be disclosed. All direct identifiers, as well as any characteristics that might lead to identification, are omitted from the dataset. Any intentional identification or disclosure of an individual or establishment violates the assurances of confidentiality given to the providers of the information. Therefore, users will:

Use the data in this dataset for statistical reporting and analysis only.

Make no attempt to learn the identity of any person or establishment included in these data.

Not link this dataset with individually identifiable data from other NCHS or non-NCHS datasets.

Not engage in any efforts to assess disclosure methodologies applied to protect individuals and establishments or any research on methods of re-identification of individuals and establishments.

By using these data you signify your agreement to comply with the above-stated statutorily based requirements.

Sanctions for Violating NCHS Data Use Agreement

Willfully disclosing any information that could identify a person or establishment in any manner to a person or agency not entitled to receive it, shall be guilty of a class E felony and imprisoned for not more than 5 years, or fined not more than $250,000, or both.
u
Police Annual Statistical Report - Search of Persons - Catalogue - Canadian...
data.urbandatacentre.ca
Updated Oct 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Police Annual Statistical Report - Search of Persons - Catalogue - Canadian Urban Data Catalogue (CUDC) [Dataset]. https://data.urbandatacentre.ca/dataset/city-toronto-police-annual-statistical-report-search-of-persons
Explore at:
Dataset updated
Oct 19, 2025
Description
This dataset includes all Level 3 and Level 4 searches that were conducted. In accordance with the Municipal Freedom of Information and Protection of Privacy Act, the Toronto Police Service has taken the necessary measures to protect the privacy of individuals involved in the reported occurrences. No personal information related to any of the parties involved in the occurrence will be released as open data. This data is aggregated by search year and criteria selection. There was a change in reporting effective October 2020. As a result, the type of item found during the search is not collected in a comparable manner. Now the information is identified as whether or not an object has been identified. This change has been reflected in the dataset. General Qualifiers Dependent on data entered into the Booking – 3 Search of Person Text Template from Versadex Filtered by Search Date Cannot be broken down by division due to consistency issues with data entry May include duplicates if multiple text templates entered for the same search
Air Traffic Passenger Statistics
kaggle.com
data.sfgov.org
+2more
zip
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abhijeeth (2023). Air Traffic Passenger Statistics [Dataset]. https://www.kaggle.com/datasets/vstacknocopyright/air-traffic-passenger-statistics/code
Explore at:
zip(411929 bytes)Available download formats
Dataset updated
Jun 3, 2023
Authors
Abhijeeth
License
https://www.usa.gov/government-works/https://www.usa.gov/government-works/
Description
A. SUMMARY San Francisco International Airport Report on Monthly Passenger Traffic Statistics by Airline.

B. HOW THE DATASET IS CREATED Data is self-reported by airlines and is only available at a monthly level

C. UPDATE PROCESS Data updated quarterly

D. HOW TO USE THIS DATASET Airport data is seasonal in nature, therefore any comparative analyses should be done on a period-over-period basis (i.e. January 2010 vs. January 2009) as opposed to period-to-period (i.e. January 2010 vs. February 2010). It is also important to note that fact and attribute field relationships are not always 1-to-1. For example, Passenger Counts belonging to United Airlines will appear in multiple attribute fields and are additive, which provides flexibility for the user to derive categorical Passenger Counts as desired.
FE data library: other statistics and research - Dataset - data.gov.uk
ckan.publishing.service.gov.uk
Updated Oct 21, 2015
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ckan.publishing.service.gov.uk (2015). FE data library: other statistics and research - Dataset - data.gov.uk [Dataset]. https://ckan.publishing.service.gov.uk/dataset/fe-data-library-other-statistics-and-research
Explore at:
Dataset updated
Oct 21, 2015
Dataset provided by
CKANhttps://ckan.org/
License
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Description
Other statistics published alongside the statistical first release. These are not National Statistics, but complement the information in the main release. FE trends FE trends provides an overview of adult (19+) government-funded further education and all age apprenticeships in England. It looks to provide trends between 2008/09 and 2013/14 and to give an overview of FE provision, characteristics of learners and outcomes over time. International Comparisons Supplementary Tables The Organisation for Economic Co-operation and Development (OECD) produces an annual publication, Education at a Glance, providing a variety of comparisons between OECD countries. The table provided here contains a summary of the relative ranking in education attainment of the 25-64 year old population in OECD countries in 2012. The OECD statistics use the International Standard Classification of Education. Within this, “at least upper secondary education” is equivalent to holding qualifications at Level 2 or above in the UK, and “tertiary education” is equivalent to holding qualifications at Level 4 or above in the UK. STEM This research is the result of a Department for Business, Innovation and Skills (BIS) funded, sector led project to gather and analyse data to inform the contribution that further education makes to STEM in England. This project was led by The Royal Academy of Engineering, and governance of the project was specifically designed to ensure that those with an interest in STEM were actively engaged and involved in directing and prioritising outputs. The November 2012 report builds on the FE and Skills STEM Data report published in July 2011 (below). It provides further analysis and interpretation of the existing data in a highly graphical format. It uses the same classified list of S,T, E and M qualifications as the 2011 report compiled through an analysis of the Register of Regulated Qualifications and the Learning Aim Database, updated with the most recent completions and achievements data taken from the Individualised Learner Record and the National Pupil Database.
s
Statistics on Obesity, Physical Activity and Diet, England - Dataset -...
ckan.publishing.service.gov.uk
Updated Dec 10, 2011
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2011). Statistics on Obesity, Physical Activity and Diet, England - Dataset - data.gov.uk [Dataset]. https://ckan.publishing.service.gov.uk/dataset/statistics_on_obesity_physical_activity_and_diet_england
Explore at:
Dataset updated
Dec 10, 2011
License
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Area covered
England
Description
This statistical report presents a range of information on obesity, physical activity and diet, drawn together from a variety of sources. The topics covered include: Overweight and obesity prevalence among adults and children Physical activity levels among adults and children Trends in purchases and consumption of food and drink and energy intake Health outcomes of being overweight or obese. This report contains seven chapters which consist of the following: Chapter 1: Introduction; this summarises government policies, targets and outcome indicators in this area, as well as providing sources of further information and links to relevant documents. Chapters 2 to 6 cover obesity, physical activity and diet and provides an overview of the key findings from these sources, whilst maintaining useful links to each section of these reports. Chapter 7: Health Outcomes; presents a range of information about the health outcomes of being obese or overweight which includes information on health risks, hospital admissions and prescription drugs used for treatment of obesity. Figures presented in this report have been obtained from a number of sources and presented in a user-friendly format. Some of the data contained in the chapter have been published previously by the Health and Social Care Information Centre (HSCIC). Previously unpublished figures on obesity-related Finished Hospital Episodes and Finished Consultant Episodes for 2012-13 are presented using data from the HSCIC's Hospital Episode Statistics as well as data from the Prescribing Unit at the HSCIC on prescription items dispensed for treatment of obesity.
Uniform Appraisal Dataset Aggregate Statistics
catalog.data.gov
datasets.ai
+1more
Updated Feb 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Federal Housing Finance Agency (2025). Uniform Appraisal Dataset Aggregate Statistics [Dataset]. https://catalog.data.gov/dataset/uniform-appraisal-dataset-aggregate-statistics
Explore at:
Dataset updated
Feb 11, 2025
Dataset provided by
Federal Housing Finance Agencyhttps://www.fhfa.gov/
Description
Uniform Appraisal Dataset (UAD) Aggregate Statistics Data File and Dashboards are the nation’s first publicly available datasets of aggregate statistics on appraisal records, giving the public new access to a broad set of data points and trends found in appraisal reports.
Statistics on Alcohol, England - Dataset - data.gov.uk
ckan.publishing.service.gov.uk
Updated Dec 10, 2011
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ckan.publishing.service.gov.uk (2011). Statistics on Alcohol, England - Dataset - data.gov.uk [Dataset]. https://ckan.publishing.service.gov.uk/dataset/statistics_on_alcohol_england
Explore at:
Dataset updated
Dec 10, 2011
Dataset provided by
CKANhttps://ckan.org/
License
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Area covered
England
Description
This statistical report acts as a reference point for health issues relating to alcohol use and misuse, providing information obtained from a number of sources in a user-friendly format. It covers topics such as drinking habits and behaviours among adults (aged 16 and over) and school children (aged 11 to 15); drinking-related ill health and mortality; affordability of alcohol; alcohol-related admissions to hospital; and alcohol-related costs. The report contains previously published information and also includes additional new analyses. The new analyses are mainly obtained from the Health and Social Care Information Centre's (HSCIC) Hospital Episode Statistics (HES) system, and prescribing data. The report also includes up-to-date information on the latest alcohol related government policies and ambitions and contains links to further sources of useful information. The report used a revised methodology for estimating alcohol-related hospital admissions following a review by Public Health England, the Department of Health and the Health and Social Care Information Centre. Consequently estimates of alcohol-related hospital admissions for 2012-13, reported in this publication, are not comparable to estimates in earlier years’ publications. A back time series of estimates of alcohol-related hospital admissions, calculated using the revised methodology, for the years 2003-04 to 2011-12 were made available as additional tables on the 1st October 2014. They provide a comparable 10 year time series from 2003-04 to 2012-13.
Changes to Ofsted’s statistical reporting of inspection outcomes for...
gov.uk
s3.amazonaws.com
Updated Aug 18, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ofsted (2020). Changes to Ofsted’s statistical reporting of inspection outcomes for state-funded schools: an analysis of the changes [Dataset]. https://www.gov.uk/government/statistics/changes-to-ofsteds-statistical-reporting-of-inspection-outcomes-for-state-funded-schools-an-analysis-of-the-changes
Explore at:
Dataset updated
Aug 18, 2020
Dataset provided by
GOV.UKhttp://gov.uk/
Authors
Ofsted
Description
These statistics of state-funded schools inspections in England consist of:

main findings in HTML, pdf and word format

tables, charts and individual provider-level data in Excel format

historical datasets created using new methodology in Excel or ODS format

pre-release access list in pdf format

Official statistics are produced impartially and free from political influence.
Mortality Statistics in US Cities
kaggle.com
zip
Updated Jan 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Mortality Statistics in US Cities [Dataset]. https://www.kaggle.com/datasets/thedevastator/mortality-statistics-in-us-cities
Explore at:
zip(96624 bytes)Available download formats
Dataset updated
Jan 23, 2023
Authors
The Devastator
Area covered
United States
Description
Mortality Statistics in US Cities

Deaths by Age and Cause of Death in 2016

By Health [source]

About this dataset

This dataset contains mortality statistics for 122 U.S. cities in 2016, providing detailed information about all deaths that occurred due to any cause, including pneumonia and influenza. The data is voluntarily reported from cities with populations of 100,000 or more, and it includes the place of death and the week during which the death certificate was filed. Data is provided broken down by age group and includes a flag indicating the reliability of each data set to help inform analysis. Each row also provides longitude and latitude information for each reporting area in order to make further analysis easier. These comprehensive mortality statistics are invaluable resources for tracking disease trends as well as making comparisons between different areas across the country in order to identify public health risks quickly and effectively

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset contains mortality rates for 122 U.S. cities in 2016, including deaths by age group and cause of death. The data can be used to study various trends in mortality and contribute to the understanding of how different diseases impact different age groups across the country.

In order to use the data, firstly one has to identify which variables they would like to use from this dataset. These include: reporting area; MMWR week; All causes by age greater than 65 years; All causes by age 45-64 years; All causes by age 25-44 years; All causes by age 1-24 years; All causes less than 1 year old; Pneumonia and Influenza total fatalities; Location (1 & 2); flag indicating reliability of data.

Once you have identified the variables that you are interested in,you will need to filter the dataset so that it only includes relevant information for your analysis or research purposes. For example, if you are looking at trends between different ages, then all you would need is information on those 3 specific cause groups (greater than 65, 45-64 and 25-44). You can do this using a selection tool that allows you to pick only certain columns from your data set or an excel filter tool if your data is stored as a csv file type .

Next step is preparing your data - it’s important for efficient analysis also helpful when there are too many variables/columns which can confuse our analysis process – eliminate unnecessary columns, rename column labels where needed etc ... In addition , make sure we clean up any missing values / outliers / incorrect entries before further investigation .Remember , outliers or corrupt entries may lead us into incorrect conclusions upon analyzing our set ! Once we complete the cleaning steps , now its safe enough transit into drawing insights !

The last step involves using statistical methods such as linear regression with multiple predictors or descriptive statistical measures such as mean/median etc ..to draw key insights based on analysis done so far and generate some actionable points !

With these steps taken care off , now its easier for anyone who decides dive into another project involving this particular dataset with added advantage formulated out of existing work done over our previous investigations!

Research Ideas

Creating population health profiles for cities in the U.S.

Tracking public health trends across different age groups

Analyzing correlations between mortality and geographical locations

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: Dataset copyright by authors - You are free to: - Share - copy and redistribute the material in any medium or format for any purpose, even commercially. - Adapt - remix, transform, and build upon the material for any purpose, even commercially. - You must: - Give appropriate credit - Provide a link to the license, and indicate if changes were made. - ShareAlike - You must distribute your contributions under the same license as the original. - Keep intact - all notices that refer to this license, including copyright notices.

Columns

File: rows.csv | Column name | Description | |:--------------------------------------------|:-----------------------------------...
Statistics on Smoking, England - Dataset - data.gov.uk
ckan.publishing.service.gov.uk
Updated Dec 10, 2011
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ckan.publishing.service.gov.uk (2011). Statistics on Smoking, England - Dataset - data.gov.uk [Dataset]. https://ckan.publishing.service.gov.uk/dataset/statistics_on_smoking_england
Explore at:
Dataset updated
Dec 10, 2011
Dataset provided by
CKANhttps://ckan.org/
License
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Area covered
England
Description
This statistical report presents a range of information on smoking which is drawn together from a variety of sources. The report aims to present a broad picture of health issues relating to smoking in England and covers topics such as smoking prevalence, habits, behaviours and attitudes among adults and school children, smoking-related ill health and mortality and smoking-related costs. This report combines data from different sources presenting it in a user-friendly format. It contains data and information previously published by the Health and Social Care Information Centre (HSCIC), Department of Health, the Office for National Statistics and Her Majesty’s Revenue and Customs. The report also includes new analyses carried out by the Health and Social Care Information Centre.

Facebook

Twitter

Click to copy link

Link copied

Cite

Kingsley Okoye; Samira Hosseini (2023). Collection of example datasets used for the book - R Programming - Statistical Data Analysis in Research [Dataset]. http://doi.org/10.6084/m9.figshare.24728073.v1

Collection of example datasets used for the book - R Programming - Statistical Data Analysis in Research

Explore at:

txtAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.24728073.v1

Dataset updated

Dec 4, 2023

Dataset provided by

Figsharehttp://figshare.com/

Authors

Kingsley Okoye; Samira Hosseini

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This book is written for statisticians, data analysts, programmers, researchers, teachers, students, professionals, and general consumers on how to perform different types of statistical data analysis for research purposes using the R programming language. R is an open-source software and object-oriented programming language with a development environment (IDE) called RStudio for computing statistics and graphical displays through data manipulation, modelling, and calculation. R packages and supported libraries provides a wide range of functions for programming and analyzing of data. Unlike many of the existing statistical softwares, R has the added benefit of allowing the users to write more efficient codes by using command-line scripting and vectors. It has several built-in functions and libraries that are extensible and allows the users to define their own (customized) functions on how they expect the program to behave while handling the data, which can also be stored in the simple object system.For all intents and purposes, this book serves as both textbook and manual for R statistics particularly in academic research, data analytics, and computer programming targeted to help inform and guide the work of the R users or statisticians. It provides information about different types of statistical data analysis and methods, and the best scenarios for use of each case in R. It gives a hands-on step-by-step practical guide on how to identify and conduct the different parametric and non-parametric procedures. This includes a description of the different conditions or assumptions that are necessary for performing the various statistical methods or tests, and how to understand the results of the methods. The book also covers the different data formats and sources, and how to test for reliability and validity of the available datasets. Different research experiments, case scenarios and examples are explained in this book. It is the first book to provide a comprehensive description and step-by-step practical hands-on guide to carrying out the different types of statistical analysis in R particularly for research purposes with examples. Ranging from how to import and store datasets in R as Objects, how to code and call the methods or functions for manipulating the datasets or objects, factorization, and vectorization, to better reasoning, interpretation, and storage of the results for future use, and graphical visualizations and representations. Thus, congruence of Statistics and Computer programming for Research.

Clear search

Close search

Google apps

Main menu

Collection of example datasets used for the book - R Programming -...

Mental Health and Learning Disabilities Statistics

MEDLINE/PubMed Baseline Statistics: Min/Max Report

COVID-19 Combined Data-set with Improved Measurement Errors

2013 NSDUH Statistical Inference Report

medlinepubmed-baseline-statistics-misc-report

NSDUH 2022 Statistical Inference Report

Labour Market Statistics Statistical Bulletin time series dataset - Dataset...

Data Use in Academia Dataset

San Francisco Incident Reports (2018-present)

Context

Content

Acknowledgements

NCHS mortality data 2014-2022

Police Annual Statistical Report - Search of Persons - Catalogue - Canadian...

Air Traffic Passenger Statistics

FE data library: other statistics and research - Dataset - data.gov.uk

Statistics on Obesity, Physical Activity and Diet, England - Dataset -...

Uniform Appraisal Dataset Aggregate Statistics

Statistics on Alcohol, England - Dataset - data.gov.uk

Changes to Ofsted’s statistical reporting of inspection outcomes for...

Mortality Statistics in US Cities

Mortality Statistics in US Cities

Deaths by Age and Cause of Death in 2016

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Statistics on Smoking, England - Dataset - data.gov.uk

Collection of example datasets used for the book - R Programming - Statistical Data Analysis in Research