Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Objectives: In quantitative research, understanding basic parameters of the study population is key for interpretation of the results. As a result, it is typical for the first table ("Table 1") of a research paper to include summary statistics for the study data. Our objectives are 2-fold. First, we seek to provide a simple, reproducible method for providing summary statistics for research papers in the Python programming language. Second, we seek to use the package to improve the quality of summary statistics reported in research papers.
Materials and Methods: The tableone package is developed following good practice guidelines for scientific computing and all code is made available under a permissive MIT License. A testing framework runs on a continuous integration server, helping to maintain code stability. Issues are tracked openly and public contributions are encouraged.
Results: The tableone software package automatically compiles summary statistics into publishable formats such as CSV, HTML, and LaTeX. An executable Jupyter Notebook demonstrates application of the package to a subset of data from the MIMIC-III database. Tests such as Tukey's rule for outlier detection and Hartigan's Dip Test for modality are computed to highlight potential issues in summarizing the data.
Discussion and Conclusion: We present open source software for researchers to facilitate carrying out reproducible studies in Python, an increasingly popular language in scientific research. The toolkit is intended to mature over time with community feedback and input. Development of a common tool for summarizing data may help to promote good practice when used as a supplement to existing guidelines and recommendations. We encourage use of tableone alongside other methods of descriptive statistics and, in particular, visualization to ensure appropriate data handling. We also suggest seeking guidance from a statistician when using tableone for a research study, especially prior to submitting the study for publication.
Facebook
TwitterOpen Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Initial Participation (HEIP30) headline data
Facebook
TwitterSpatial analysis and statistical summaries of the Protected Areas Database of the United States (PAD-US) provide land managers and decision makers with a general assessment of management intent for biodiversity protection, natural resource management, and recreation access across the nation. The PAD-US 3.0 Combined Fee, Designation, Easement feature class (with Military Lands and Tribal Areas from the Proclamation and Other Planning Boundaries feature class) was modified to remove overlaps, avoiding overestimation in protected area statistics and to support user needs. A Python scripted process ("PADUS3_0_CreateVectorAnalysisFileScript.zip") associated with this data release prioritized overlapping designations (e.g. Wilderness within a National Forest) based upon their relative biodiversity conservation status (e.g. GAP Status Code 1 over 2), public access values (in the order of Closed, Restricted, Open, Unknown), and geodatabase load order (records are deliberately organized in the PAD-US full inventory with fee owned lands loaded before overlapping management designations, and easements). The Vector Analysis File ("PADUS3_0VectorAnalysisFile_ClipCensus.zip") associated item of PAD-US 3.0 Spatial Analysis and Statistics ( https://doi.org/10.5066/P9KLBB5D ) was clipped to the Census state boundary file to define the extent and serve as a common denominator for statistical summaries. Boundaries of interest to stakeholders (State, Department of the Interior Region, Congressional District, County, EcoRegions I-IV, Urban Areas, Landscape Conservation Cooperative) were incorporated into separate geodatabase feature classes to support various data summaries ("PADUS3_0VectorAnalysisFileOtherExtents_Clip_Census.zip") and Comma-separated Value (CSV) tables ("PADUS3_0SummaryStatistics_TabularData_CSV.zip") summarizing "PADUS3_0VectorAnalysisFileOtherExtents_Clip_Census.zip" are provided as an alternative format and enable users to explore and download summary statistics of interest (Comma-separated Table [CSV], Microsoft Excel Workbook [.XLSX], Portable Document Format [.PDF] Report) from the PAD-US Lands and Inland Water Statistics Dashboard ( https://www.usgs.gov/programs/gap-analysis-project/science/pad-us-statistics ). In addition, a "flattened" version of the PAD-US 3.0 combined file without other extent boundaries ("PADUS3_0VectorAnalysisFile_ClipCensus.zip") allow for other applications that require a representation of overall protection status without overlapping designation boundaries. The "PADUS3_0VectorAnalysis_State_Clip_CENSUS2020" feature class ("PADUS3_0VectorAnalysisFileOtherExtents_Clip_Census.gdb") is the source of the PAD-US 3.0 raster files (associated item of PAD-US 3.0 Spatial Analysis and Statistics, https://doi.org/10.5066/P9KLBB5D ). Note, the PAD-US inventory is now considered functionally complete with the vast majority of land protection types represented in some manner, while work continues to maintain updates and improve data quality (see inventory completeness estimates at: http://www.protectedlands.net/data-stewards/ ). In addition, changes in protected area status between versions of the PAD-US may be attributed to improving the completeness and accuracy of the spatial data more than actual management actions or new acquisitions. USGS provides no legal warranty for the use of this data. While PAD-US is the official aggregation of protected areas ( https://www.fgdc.gov/ngda-reports/NGDA_Datasets.html ), agencies are the best source of their lands data.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Full summary statistics from 41 epigenome-wide association studies (EWAS) conducted by The EWAS Catalog team (www.ewascatalog.org). Meta-data is found in the "studies-full.csv" file and the results are in "full_stats.tar.gz". Unzipping the "full_stats.tar.gz" file will reveal a folder containing 41 csv files, each with the full summary statistics from one EWAS. The results can be linked to the meta-data using the "Results_file" column in "studies-full.csv". These analyses were conducted using data extracted from the Gene Expression Omnibus (GEO). These data were extracted using the geograbi R package. For more information on the EWAS, please consult our paper: Battram, Thomas, et al. "The EWAS Catalog: A Database of Epigenome-wide Association Studies." OSF Preprints, 4 Feb. 2021. https://doi.org/10.31219/osf.io/837wn. Please cite the paper if you use this dataset.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
CSV file containing summary statistics of proteins in association with incident CAD from logistic regression after adjusting for demographics, fasting status, glycemic status, BMI, and HbA1c.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset is designed for users aiming to train models for text summarization. It contains 2,225 rows of data with two columns: "Text" and "Summary". Each row features a detailed news article or piece of text paired with its corresponding summary, providing a rich resource for developing and fine-tuning summarization algorithms.
This evolving dataset is planned to include additional features, such as text class labels, in future updates. These enhancements will provide more context and facilitate the development of models that can perform summarization across different categories of news content.
Ideal for researchers and developers focused on text summarization tasks, this dataset enables the training of models to effectively compress information while retaining the essence of the original content.
We would like to extend our sincere gratitude to the dataset creator for their contribution to this valuable resource. This dataset, sourced from the BBC News Summary dataset on Kaggle, was created by Pariza. Their work has provided an invaluable asset for those working on text summarization tasks, and we appreciate their efforts in curating and sharing this data with the community.
Thank you for supporting research and development in the field of natural language processing!
This script processes and consolidates text data from various directories containing news articles and their corresponding summaries. It reads the files from specified folders, handles encoding issues, and then creates a DataFrame that is saved as a CSV file for further analysis.
Imports:
numpy (np): Numerical operations library, though it's not used in this script.pandas (pd): Data manipulation and analysis library.os: For interacting with the operating system, e.g., building file paths.glob: For file pattern matching and retrieving file paths.Function: get_texts
text_folders: List of folders containing news article text files.text_list: List to store the content of text files.summ_folder: List of folders containing summary text files.sum_list: List to store the content of summary files.encodings: List of encodings to try for reading files.text_list and sum_list.Data Preparation:
text_folder: List of directories for news articles.summ_folder: List of directories for summaries.text_list and summ_list: Initialize empty lists to store the contents.data_df: Empty DataFrame to store the final data.Execution:
get_texts function to populate text_list and summ_list.data_df with columns 'Text' and 'Summary'.data_df to a CSV file at /kaggle/working/bbc_news_data.csv.Output:
Facebook
TwitterSpatial analysis and statistical summaries of the Protected Areas Database of the United States (PAD-US) provide land managers and decision makers with a general assessment of management intent for biodiversity protection, natural resource management, and outdoor recreation access across the nation. This data release presents results from statistical summaries of the PAD-US 3.0 protection status (by GAP Status Code) and public access status for various land unit boundaries (Protected Areas Database of the United States 3.0 Vector Analysis and Summary Statistics). Summary statistics are also available to explore and download (Comma-separated Table [CSV], Microsoft Excel Workbook (.xlsx), Portable Document Format [.pdf] Report) from the PAD-US Lands and Inland Water Statistics Dashboard ( https://www.usgs.gov/programs/gap-analysis-project/science/pad-us-statistics ). The vector GIS analysis file, source data used to summarize statistics for areas of interest to stakeholders (National, State, Department of the Interior Region, Congressional District, County, EcoRegions I-IV, Urban Areas, Landscape Conservation Cooperative), and complete Summary Statistics Tabular Data (CSV) are included in this data release. Raster GIS analysis files are also available for combination with other raster data (Protected Areas Database of the United States (PAD-US) 3.0 Raster Analysis). The PAD-US 3.0 Combined Fee, Designation, Easement feature class in the full inventory, with Military Lands and Tribal Areas from the Proclamation and Other Planning Boundaries feature class (Protected Areas Database of the United States (PAD-US) 3.0, https://doi.org/10.5066/P9Q9LQ4B), was modified to prioritize and remove overlapping management designations, limiting overestimation in protection status or public access statistics and to support user needs for vector and raster analysis data. Analysis files in this data release were clipped to the Census State boundary file to define the extent and fill in areas (largely private land) outside the PAD-US, providing a common denominator for statistical summaries.
Facebook
TwitterThe purpose of this USGS data release is to publish NC SELDM streamflow statistics and summary statistics of physical and chemical data in support of the information provided in the above-referenced report. This data release consists of two data sets, "NC SELDM streamflow statistics..." and "NC SELDM summary statistics for physical and chemical data...". The tables that are uploaded for the "NC SELDM streamflow statistics for 266 streamgages across North Carolina" sub-section are primarily the support files for the StreamStatsDB update that was completed when the report was approved. These files were generated using the GNWISQ and QSTATS computer programs developed and described by Granato (2009, appendices 1 and 4). This is discussed near the end of the "Prestorm streamflow statistics" section in the above-referenced report. A large table of selected site attributes and StreamStats basin characteristics that were compiled for the 266 streamgages is also provided as a part of this data release. A ReadMe file is also included in the sub-section of the data release. The tables that are uploaded for the "NC SELDM summary statistics for physical and chemical data at NC highway-runoff and bridge-deck sites" sub-section of the data release support the statewide medians table (Table 7) discussed within the "Simulating highway-runoff quality" section in the above-referenced report. This is a .csv file for each of the 11 constituents referenced in Table 11. Descriptions of the data fields (or columns) in the .csv tables are provided at the top of each .csv file. A ReadMe file is also included in the sub-section of the data release.
Facebook
TwitterThe release presents the mean, median, lower quartile and upper quartile total monthly rent paid, for a number of bedroom/room categories. This covers each local authority in England, for the 12 months to the end of September 2015.
For further details on the information included in this release, including a glossary of terms and a variable list for the CSV format files, please refer to the statistical summary.
Facebook
Twitterhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
This CSV dataset provides comprehensive information about house prices. It consists of 9,819 entries and 54 columns, offering a wealth of features for analysis. The dataset includes various numerical and categorical variables, providing insights into factors that influence house prices.
The key columns in the dataset are as follows:
In addition to these, the dataset contains several other features related to various amenities and facilities available in the houses, such as double-glazed windows, central air conditioning, central heating, waste disposal, furnished status, service elevators, and more.
By performing exploratory data analysis on this dataset using Python and the Pandas library, valuable insights can be gained regarding the relationships between different variables and the impact they have on house prices. Descriptive statistics, data visualization, and feature engineering techniques can be applied to uncover patterns and trends in the housing market.
This dataset serves as a valuable resource for real estate professionals, analysts, and researchers interested in understanding the factors that contribute to house prices and making informed decisions in the real estate market.
Facebook
TwitterData that was used to train the SVM. As the train-test data were assigned randomly for every training iteration, the individual data used for generating the subfigures b–e are not separately listed, as these cannot be manually recreated but depend on the train-test assignment by the algorithm. (ZIP)
Facebook
TwitterOpen Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
HE Participation by age 25 (CHEP-25) key stats
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Full summary statistics from 387 epigenome-wide association studies (EWAS) conducted by The EWAS Catalog team (http://www.ewascatalog.org/). Meta-data is found in the "studies-full.csv" file and the results are in "full_stats.tar.gz". Unzipping the "full_stats.tar.gz" file will reveal a folder containing 387 csv files, each with the full summary statistics from one EWAS. The results can be linked to the meta-data using the "Results_file" column in "studies-full.csv". These analyses were conducted using data from the Accessible Resource for Integrated Epigenomics Studies (ARIES) subset of the Avon Longitudinal Study of Parents and Children (ALSPAC) cohort. For more information on the EWAS, please consult our paper: Battram, Thomas, et al. "The EWAS Catalog: A Database of Epigenome-wide Association Studies." OSF Preprints, 4 Feb. 2021. https://doi.org/10.31219/osf.io/837wn. Please cite the paper if you use the dataset.
Facebook
TwitterOpen Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Summary statistics of figures supplied by public sector bodies (up to and including 17 November 2021) covering the four years of the target (period covering 1 April 2017 to 31 March 2021)Reporting periods: 2017-18 to 2020-21Indicators: Apprentices (prior to period, new in period, at end of period, cumulative since April 2017)Employees (prior to period, new in period, at end of period, cumulative since April 2017)Apprenticeship percentage (prior to period, new in period, at end of period cumulative since April 2017)Number of employers in the periodPercentage of employees starting apprenticeships in periodFilters: Subsector
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Data for the top 10 scorers in the NBA from the years 2000-2024. Scraped using nba_api.
leaders.csv - - General season statistics for each season's top 10 scorers
shotsXXXXs.csv - - Shot details for every made shot from each season's top 10 scorers
shots2000s.csv - - Data from 2000-01 season through 2009-10 season
shots2010s.csv - - Data from 2010-11 season through 2019-20 season
shots2020s.csv - - Data from 2020-21 season through 2023-24 season
Facebook
TwitterOpen Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
This file contains a summary of the headline, national level figures for this publication.
Facebook
TwitterAttribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
It is a widely accepted fact that evolving software systems change and grow. However, it is less well-understood how change is distributed over time, specifically in object oriented software systems. The patterns and techniques used to measure growth permit developers to identify specific releases where significant change took place as well as to inform them of the longer term trend in the distribution profile. This knowledge assists developers in recording systemic and substantial changes to a release, as well as to provide useful information as input into a potential release retrospective. However, these analysis methods can only be applied after a mature release of the code has been developed. But in order to manage the evolution of complex software systems effectively, it is important to identify change-prone classes as early as possible. Specifically, developers need to know where they can expect change, the likelihood of a change, and the magnitude of these modifications in order to take proactive steps and mitigate any potential risks arising from these changes. Previous research into change-prone classes has identified some common aspects, with different studies suggesting that complex and large classes tend to undergo more changes and classes that changed recently are likely to undergo modifications in the near future. Though the guidance provided is helpful, developers need more specific guidance in order for it to be applicable in practice. Furthermore, the information needs to be available at a level that can help in developing tools that highlight and monitor evolution prone parts of a system as well as support effort estimation activities. The specific research questions that we address in this chapter are: (1) What is the likelihood that a class will change from a given version to the next? (a) Does this probability change over time? (b) Is this likelihood project specific, or general? (2) How is modification frequency distributed for classes that change? (3) What is the distribution of the magnitude of change? Are most modifications minor adjustments, or substantive modifications? (4) Does structural complexity make a class susceptible to change? (5) Does popularity make a class more change-prone? We make recommendations that can help developers to proactively monitor and manage change. These are derived from a statistical analysis of change in approximately 55000 unique classes across all projects under investigation. The analysis methods that we applied took into consideration the highly skewed nature of the metric data distributions. The raw metric data (4 .txt files and 4 .log files in a .zip file measuring ~2MB in total) is provided as a comma separated values (CSV) file, and the first line of the CSV file contains the header. A detailed output of the statistical analysis undertaken is provided as log files generated directly from Stata (statistical analysis software).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains results of a genome-wide association study of distinct chronic musculoskeletal pain conditions: back pain, knee pain, neck pain, and hip pain. Additionally, there are genome-wide association summary statistics for four genetically independent components of pain conditions, listed above. For more details, please, read the paper XXX.
All files contain association summary statistics for genome-wide association meta-analysis of the 265,000 white British individuals from the UK Biobank and additional 191,580 individuals of European Ancestry from the UK biobank (total N = 456,580). Cases and controls were defined based on questionnaire responses. First, participants responded to “Pain type(s) experienced in the last months” followed by questions inquiring if the specific pain had been present for more than 3 months. Those who reported back, neck or shoulder, hip, or knee pain lasting more than 3 months were considered chronic back, neck/shoulder, hip, and knee pain cases, respectively. Participants reporting no such pain lasting longer than 3 months were considered controls (regardless of whether they had another regional chronic pain, such as abdominal pain, or not). Individuals who preferred not to answer were excluded from the study. Besides this, we excluded individuals who reported more than 3 months of pain all over the body.
The data are provided on an "AS-IS" basis, without warranty of any type, expressed or implied, including but not limited to any warranty as to their performance, merchantability, or fitness for any particular purpose. If investigators use these data, any and all consequences are entirely their responsibility. By downloading and using these data, you agree that you will cite the appropriate publication in any communications or publications arising directly or indirectly from these data; for utilization of data available prior to publication, you agree to respect the requested responsibilities of resource users under 2003 Fort Lauderdale principles; you agree that you will never attempt to identify any participant. This research has been conducted using the UK Biobank Resource and the use of the data is guided by the principles formulated by the UK Biobank.
When using downloaded data, please cite the corresponding paper and this repository:
Funding:
The work of YSA and SZS was supported by the Russian Ministry of Education and Science under the 5-100 Excellence Programme and by the Federal Agency of Scientific Organizations via the Institute of Cytology and Genetics (project 0324-2019-0040). The work of YAT, ASSh, and EEE was supported by the Russian Foundation for Basic Research (project 19-015-00151). The contribution of LСK was funded by PolyOmica. Dr. Suri was supported by VA Career Development Award # 1IK2RX001515 from the United States (U.S.) Department of Veterans Affairs Rehabilitation Research and Development (RR&D) Service. Dr. Suri is a Staff Physician at the VA Puget Sound Health Care System. The contents of this work do not represent the views of the U.S. Department of Veterans Affairs or the United States Government.
List of files:
Column headers:
Facebook
TwitterOpen Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Headline statistics used in the blue summary boxes of the publication.
Facebook
TwitterOpen Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Key summary statistics
- Explore Education Statistics data set Headline Stats from Widening participation in higher education
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Objectives: In quantitative research, understanding basic parameters of the study population is key for interpretation of the results. As a result, it is typical for the first table ("Table 1") of a research paper to include summary statistics for the study data. Our objectives are 2-fold. First, we seek to provide a simple, reproducible method for providing summary statistics for research papers in the Python programming language. Second, we seek to use the package to improve the quality of summary statistics reported in research papers.
Materials and Methods: The tableone package is developed following good practice guidelines for scientific computing and all code is made available under a permissive MIT License. A testing framework runs on a continuous integration server, helping to maintain code stability. Issues are tracked openly and public contributions are encouraged.
Results: The tableone software package automatically compiles summary statistics into publishable formats such as CSV, HTML, and LaTeX. An executable Jupyter Notebook demonstrates application of the package to a subset of data from the MIMIC-III database. Tests such as Tukey's rule for outlier detection and Hartigan's Dip Test for modality are computed to highlight potential issues in summarizing the data.
Discussion and Conclusion: We present open source software for researchers to facilitate carrying out reproducible studies in Python, an increasingly popular language in scientific research. The toolkit is intended to mature over time with community feedback and input. Development of a common tool for summarizing data may help to promote good practice when used as a supplement to existing guidelines and recommendations. We encourage use of tableone alongside other methods of descriptive statistics and, in particular, visualization to ensure appropriate data handling. We also suggest seeking guidance from a statistician when using tableone for a research study, especially prior to submitting the study for publication.