100+ datasets found

d
Harvard Common Data Set
search.dataone.org
dataverse.harvard.edu
Updated Nov 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Office of Institutional Research (2023). Harvard Common Data Set [Dataset]. http://doi.org/10.7910/DVN/AOD2ZV
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/AOD2ZV
Dataset updated
Nov 21, 2023
Dataset provided by
Harvard Dataverse
Authors
Office of Institutional Research
Description
This represents Harvard's responses to the Common Data Initiative. The Common Data Set (CDS) initiative is a collaborative effort among data providers in the higher education community and publishers as represented by the College Board, Peterson's, and U.S. News & World Report. The combined goal of this collaboration is to improve the quality and accuracy of information provided to all involved in a student's transition into higher education, as well as to reduce the reporting burden on data providers. This goal is attained by the development of clear, standard data items and definitions in order to determine a specific cohort relevant to each item. Data items and definitions used by the U.S. Department of Education in its higher education surveys often serve as a guide in the continued development of the CDS. Common Data Set items undergo broad review by the CDS Advisory Board as well as by data providers representing secondary schools and two- and four-year colleges. Feedback from those who utilize the CDS also is considered throughout the annual review process.
i
The USyd Campus Dataset
ieee-dataport.org
Updated May 18, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wei Zhou (2022). The USyd Campus Dataset [Dataset]. https://ieee-dataport.org/open-access/usyd-campus-dataset
Explore at:
Dataset updated
May 18, 2022
Authors
Wei Zhou
Description
navigation and deep-learning applications. Despite this success
P
C4 Dataset
paperswithcode.com
Updated Dec 13, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Colin Raffel; Noam Shazeer; Adam Roberts; Katherine Lee; Sharan Narang; Michael Matena; Yanqi Zhou; Wei Li; Peter J. Liu (2023). C4 Dataset [Dataset]. https://paperswithcode.com/dataset/c4
Explore at:
Dataset updated
Dec 13, 2023
Authors
Colin Raffel; Noam Shazeer; Adam Roberts; Katherine Lee; Sharan Narang; Michael Matena; Yanqi Zhou; Wei Li; Peter J. Liu
Description
C4 is a colossal, cleaned version of Common Crawl's web crawl corpus. It was based on Common Crawl dataset: https://commoncrawl.org. It was used to train the T5 text-to-text Transformer models.

The dataset can be downloaded in a pre-processed form from allennlp.
Global Roads Open Access Data Set, Version 1 (gROADSv1)
data.nasa.gov
datasets.ai
+4more
Updated Apr 23, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). Global Roads Open Access Data Set, Version 1 (gROADSv1) [Dataset]. https://data.nasa.gov/dataset/global-roads-open-access-data-set-version-1-groadsv1
Explore at:
Dataset updated
Apr 23, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
The Global Roads Open Access Data Set, Version 1 (gROADSv1) was developed under the auspices of the CODATA Global Roads Data Development Task Group. The data set combines the best available roads data by country into a global roads coverage, using the UN Spatial Data Infrastructure Transport (UNSDI-T) version 2 as a common data model. All country road networks have been joined topologically at the borders, and many countries have been edited for internal topology. Source data for each country are provided in the documentation, and users are encouraged to refer to the readme file for use constraints that apply to a small number of countries. Because the data are compiled from multiple sources, the date range for road network representations ranges from the 1980s to 2010 depending on the country (most countries have no confirmed date), and spatial accuracy varies. The baseline global data set was compiled by the Information Technology Outreach Services (ITOS) of the University of Georgia. Updated data for 27 countries and 6 smaller geographic entities were assembled by Columbia University's Center for International Earth Science Information Network (CIESIN), with a focus largely on developing countries with the poorest data coverage.
f
An ontology-based rare disease common data model harmonising international...
figshare.com
csv
Updated Jan 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adam S.L. Graefe; Sophie AI Klopfenstein; Daniel Danis; Peter N. Robinson; Jana Zschüntzsch; Susanna Wiegand; Peter Kühnen; Oya Beyan; Sylvia Thun; Elisabeth Félicité Nyoungui; Filip Rehburg (2025). An ontology-based rare disease common data model harmonising international registries, FHIR, and Phenopackets [Dataset]. http://doi.org/10.6084/m9.figshare.26509150.v7
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.26509150.v7
Dataset updated
Jan 23, 2025
Dataset provided by
figshare
Authors
Adam S.L. Graefe; Sophie AI Klopfenstein; Daniel Danis; Peter N. Robinson; Jana Zschüntzsch; Susanna Wiegand; Peter Kühnen; Oya Beyan; Sylvia Thun; Elisabeth Félicité Nyoungui; Filip Rehburg
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Please see our GitHub repository here: https://github.com/BIH-CEI/rd-cdm/ Please see our RD CDM documentation here: https://rd-cdm.readthedocs.io/en/latest/index.html/ Attention: The RD CDM paper is currently under review (version 2.0.0.dev0). As soon as the paper is accepted, we will publish v2.0.0. For more information please see our ChangeLog: https://rd-cdm.readthedocs.io/en/latest/changelog.htmlWe introduce our RD CDM v2.0.0— a common data model specifically designed for rare diseases. This RD CDM simplifies the capture, storage, and exchange of complex clinical data, enabling researchers and healthcare providers to work with harmonized datasets across different institutions and countries. The RD CDM is based on the ERDRI-CDS, a common data set developed by the European Rare Disease Research Infrastructure (ERDRI) to support the collection of harmonized data for rare disease research. By extending the ERDRI-CDS with additional concepts and relationships, based on HL7 FHIR v4.0.1 and the GA4GH Phenopacket Schema v2.0, the RD CDM provides a comprehensive model for capturing detailed clinical information alongisde precise genetic data on rare diseases.Background:Rare diseases (RDs), though individually rare, collectively impact over 260 million people worldwide, with over 17 million affected in Europe. These conditions, defined by their low prevalence of fewer than 5 in 10,000 individuals, are often genetically driven, with over 70% of cases suspected to have a genetic cause. Despite significant advances in medical research, RD patients still face lengthy diagnostic delays, often due to a lack of awareness in general healthcare settings and the rarity of RD-specific knowledge among clinicians. Misdiagnosis and underrepresentation in routine care further compound the challenges, leaving many patients without timely and accurate diagnoses.Interoperability plays a critical role in addressing these challenges, ensuring the seamless exchange and interpretation of medical data through the use of internationally agreed standards. In the field of rare diseases, where data is often scarce and scattered, the importance of structured, standardized, and reusable medical records cannot be overstated. Interoperable data formats allow for more efficient research, better care coordination, and a clearer understanding of complex clinical cases. However, existing medical systems often fail to support the depth of phenotypic and genotypic data required for rare disease research and treatment, making interoperability a crucial enabler for improving outcomes in RD care.
common_voice_16_0
huggingface.co
Updated Dec 22, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mozilla Foundation (2023). common_voice_16_0 [Dataset]. https://huggingface.co/datasets/mozilla-foundation/common_voice_16_0
Explore at:
Dataset updated
Dec 22, 2023
Dataset authored and provided by
Mozilla Foundationhttp://mozilla.org/
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
Dataset Card for Common Voice Corpus 16

Dataset Summary

The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 30328 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help improve the accuracy of speech recognition engines. The dataset currently consists of 19673 validated hours in 120 languages, but more voices and languages are always added. Take a look at the Languages page to… See the full description on the dataset page: https://huggingface.co/datasets/mozilla-foundation/common_voice_16_0.
Forest Inventory and Analysis Database
data-usfs.hub.arcgis.com
datadiscoverystudio.org
+9more
Updated Apr 14, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Forest Service (2017). Forest Inventory and Analysis Database [Dataset]. https://data-usfs.hub.arcgis.com/documents/usfs::forest-inventory-and-analysis-database
Explore at:
Dataset updated
Apr 14, 2017
Dataset provided by
U.S. Department of Agriculture Forest Servicehttp://fs.fed.us/
Authors
U.S. Forest Service
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered

Description
The Forest Inventory and Analysis (FIA) research program has been in existence since mandated by Congress in 1928. FIA's primary objective is to determine the extent, condition, volume, growth, and depletion of timber on the Nation's forest land. Before 1999, all inventories were conducted on a periodic basis. The passage of the 1998 Farm Bill requires FIA to collect data annually on plots within each State. This kind of up-to-date information is essential to frame realistic forest policies and programs. Summary reports for individual States are published but the Forest Service also provides data collected in each inventory to those interested in further analysis. Data is distributed via the FIA DataMart in a standard format. This standard format, referred to as the Forest Inventory and Analysis Database (FIADB) structure, was developed to provide users with as much data as possible in a consistent manner among States. A number of inventories conducted prior to the implementation of the annual inventory are available in the FIADB. However, various data attributes may be empty or the items may have been collected or computed differently. Annual inventories use a common plot design and common data collection procedures nationwide, resulting in greater consistency among FIA work units than earlier inventories. Links to field collection manuals and the FIADB user's manual are provided in the FIA DataMart.
h
NATCOOP dataset
heidata.uni-heidelberg.de
csv, docx, pdf, tsv +1
Updated Jan 27, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Florian Diekert; Florian Diekert; Robbert-Jan Schaap; Robbert-Jan Schaap; Tillmann Eymess; Tillmann Eymess (2022). NATCOOP dataset [Dataset]. http://doi.org/10.11588/DATA/GV8NBL
Explore at:
docx(90179), pdf(432619), csv(3441765), docx(499022), tsv(86553), pdf(473493), pdf(856157), pdf(467245), docx(101203), pdf(351653), pdf(576588), pdf(200225), pdf(124038), type/x-r-syntax(14339), pdf(345323), pdf(69467), docx(43108), pdf(268168), docx(493800), docx(25110), docx(43036), pdf(270379), pdf(77960), pdf(464499), pdf(392748), docx(42158), pdf(374488), docx(498354), pdf(282466), pdf(482954), pdf(302513), pdf(513748), pdf(126342), docx(33772), tsv(2313475), pdf(441389), pdf(92836), pdf(392718)Available download formats
Unique identifier
https://doi.org/10.11588/DATA/GV8NBL
Dataset updated
Jan 27, 2022
Dataset provided by
heiDATA
Authors
Florian Diekert; Florian Diekert; Robbert-Jan Schaap; Robbert-Jan Schaap; Tillmann Eymess; Tillmann Eymess
License
https://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.11588/DATA/GV8NBLhttps://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.11588/DATA/GV8NBL
Time period covered
Jan 1, 2017 - Jan 1, 2021
Dataset funded by
European Commission
Description
The NATCOOP project set out to study how nature shapes the preferences and incentives of economic agents and how this in turn affects common-pool resource management. Imagine a group of fishermen targeting a species that requires a lot of teamwork to harvest. Do these fishers become more social over time compared to fishers that work in a more solitary manner? If so, does this have implications for how the fishery should be managed? To study this, the NATCOOP team travelled to Chile and Tanzania and collected data using surveys and economic experiments. These two very different countries have a large population of small-scale fishermen, and both host several distinct types of fisheries. Over the course of five field trips, the project team surveyed more than 2500 fishermen with each field trip contributing to the main research question by measuring fishermen’s preferences for cooperation and risk. Additionally, each fieldtrip aimed to answer another smaller research question that was either focused on risk taking or cooperation behavior in the fisheries. The data from both surveys and experiments are now publicly available and can be freely studied by other researchers, resource managers, or interested citizens. Overall, the NATCOOP dataset contains participants’ responses to a plethora of survey questions and their actions during incentivized economic experiments. It is available in both the .dta and .csv format, and its use is recommended with statistical software such as R or Stata. For those unaccustomed with statistical analysis, we included a video tutorial on how to use the data set in the open-source program R.
common_voice_11_0
huggingface.co
Updated Nov 3, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mozilla Foundation (2022). common_voice_11_0 [Dataset]. https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0
Explore at:
Dataset updated
Nov 3, 2022
Dataset authored and provided by
Mozilla Foundationhttp://mozilla.org/
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
Dataset Card for Common Voice Corpus 11.0

Dataset Summary

The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 24210 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help improve the accuracy of speech recognition engines. The dataset currently consists of 16413 validated hours in 100 languages, but more voices and languages are always added. Take a look at the Languages page to… See the full description on the dataset page: https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0.
Z
Data from: Training dataset from the Da Vinci Research Kit
data.niaid.nih.gov
portaldelainvestigacion.uma.es
+1more
Updated Sep 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Giuseppe Tortora (2022). Training dataset from the Da Vinci Research Kit [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3830937
Explore at:
Dataset updated
Sep 21, 2022
Dataset provided by
Giuseppe Tortora
Andrea Mariani
Carlos Pérez-del-Pulgar
Irene Rivas-Blanco
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The use of data sets are getting more relevance in surgical robotics since they can be used to recognise and automate tasks in the lab. Also, it allows to use a common data set to compare different algorithms and methods. The objective of this work is to provide a complete data set of several training tasks that surgeons perform to improve their skills. For this purpose, the Da Vinci research kit has been used to perform a different training tasks. The obtained data set includes all the information provided by the da Vinci robot together with the corresponding video from the camera. Kinematic data has been collected at 50 frames per seconds, and images at 15 frames per seconds. All the information has been carefully timestamped and provided in a readable csv format. The application used to retrieve the information from the da Vinci research kit, as well as tools to access the information are also provided.
P
Common Voice Dataset
paperswithcode.com
opendatalab.com
+1more
Updated Jan 7, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rosana Ardila; Megan Branson; Kelly Davis; Michael Henretty; Michael Kohler; Josh Meyer; Reuben Morais; Lindsay Saunders; Francis M. Tyers; Gregor Weber (2021). Common Voice Dataset [Dataset]. https://paperswithcode.com/dataset/common-voice
Explore at:
Dataset updated
Jan 7, 2021
Authors
Rosana Ardila; Megan Branson; Kelly Davis; Michael Henretty; Michael Kohler; Josh Meyer; Reuben Morais; Lindsay Saunders; Francis M. Tyers; Gregor Weber
Description
Common Voice is an audio dataset that consists of a unique MP3 and corresponding text file. There are 9,283 recorded hours in the dataset. The dataset also includes demographic metadata like age, sex, and accent. The dataset consists of 7,335 validated hours in 60 languages.
Synthea synthetic patient generator data in OMOP Common Data Model
registry.opendata.aws
Updated Jan 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amazon Web Sevices (2023). Synthea synthetic patient generator data in OMOP Common Data Model [Dataset]. https://registry.opendata.aws/synthea-omop/
Explore at:
Dataset updated
Jan 4, 2023
Dataset provided by
Amazon.comhttp://amazon.com/
Description
The Synthea generated data is provided here as a 1,000 person (1k), 100,000 person (100k), and 2,800,000 persom (2.8m) data sets in the OMOP Common Data Model format. SyntheaTM is a synthetic patient generator that models the medical history of synthetic patients. Our mission is to output high-quality synthetic, realistic but not real, patient data and associated health records covering every aspect of healthcare. The resulting data is free from cost, privacy, and security restrictions. It can be used without restriction for a variety of secondary uses in academia, research, industry, and government (although a citation would be appreciated). You can read our first academic paper here: https://doi.org/10.1093/jamia/ocx079
s
US Public Schools
data.smartidf.services
public.opendatasoft.com
csv, excel, geojson +1
Updated Jan 6, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). US Public Schools [Dataset]. https://data.smartidf.services/explore/dataset/us-public-schools/
Explore at:
geojson, excel, json, csvAvailable download formats
Dataset updated
Jan 6, 2023
License
https://en.wikipedia.org/wiki/Public_domainhttps://en.wikipedia.org/wiki/Public_domain
Area covered
United States
Description
This Public Schools feature dataset is composed of all Public elementary and secondary education facilities in the United States as defined by the Common Core of Data (CCD, https://nces.ed.gov/ccd/ ), National Center for Education Statistics (NCES, https://nces.ed.gov ), US Department of Education for the 2017-2018 school year. This includes all Kindergarten through 12th grade schools as tracked by the Common Core of Data. Included in this dataset are military schools in US territories and referenced in the city field with an APO or FPO address. DOD schools represented in the NCES data that are outside of the United States or US territories have been omitted. This feature class contains all MEDS/MEDS+ as approved by NGA. Complete field and attribute information is available in the ”Entities and Attributes” metadata section. Geographical coverage is depicted in the thumbnail above and detailed in the Place Keyword section of the metadata. This release includes the addition of 3065 new records, modifications to the spatial location and/or attribution of 99,287 records, and removal of 2996 records not present in the NCES CCD data.
d
Basic Safety Message Data Emulator
catalog.data.gov
data.transportation.gov
+3more
Updated Mar 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
US Department of Transportation (2025). Basic Safety Message Data Emulator [Dataset]. https://catalog.data.gov/dataset/basic-safety-message-data-emulator
Explore at:
Dataset updated
Mar 16, 2025
Dataset provided by
US Department of Transportation
Description
The Trajectory Conversion Algorithm Version 2.3 (TCA) is designed to test different strategies for producing, transmitting, and storing Connected Vehicle information. The TCA uses vehicle trajectory data, roadside equipment (RSE) location information, cellular region information and strategy information to emulate the messages connected vehicles would produce. This data set contains common data sets generated by the TCA using the BSM and PDM at 100% market penetration for two simulated traffic networks, an arterial network (Van Ness Avenue in San Francisco, CA) and a freeway network (the interchange of I-270 and I-44 in St. Louis, MO). This legacy dataset was created before data.transportation.gov and is only currently available via the attached file(s). Please contact the dataset owner if there is a need for users to work with this data using the data.transportation.gov analysis features (online viewing, API, graphing, etc.) and the USDOT will consider modifying the dataset to fully integrate in data.transportation.gov.
h
DECOVID: Data derived from UCLH and UHB during the COVID pandemic
healthdatagateway.org
unknown
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
This publication uses data from PIONEER, an ethically approved database and analytical environment (East Midlands Derby Research Ethics 20/EM/0158), DECOVID: Data derived from UCLH and UHB during the COVID pandemic [Dataset]. https://healthdatagateway.org/dataset/998
Explore at:
unknownAvailable download formats
Dataset authored and provided by
This publication uses data from PIONEER, an ethically approved database and analytical environment (East Midlands Derby Research Ethics 20/EM/0158)
License
https://www.pioneerdatahub.co.uk/data/data-request-process/https://www.pioneerdatahub.co.uk/data/data-request-process/
Description
DECOVID, a multi-centre research consortium, was founded in March 2020 by two United Kingdom (UK) National Health Service (NHS) Foundation Trusts (comprising three acute care hospitals) and three research institutes/universities: University Hospitals Birmingham (UHB), University College London Hospitals (UCLH), University of Birmingham, University College London and The Alan Turing Institute. The original aim of DECOVID was to share harmonised electronic health record (EHR) data from UCLH and UHB to enable researchers affiliated with the DECOVID consortium to answer clinical questions to support the COVID-19 response.      The DECOVID database has now been placed within the infrastructure of PIONEER, a Health Data Research (HDR) UK funded data hub that contains data from acute care providers, to make the DECOVID database accessible to external researchers not affiliated with the DECOVID consortium.  

This highly granular dataset contains 256,804 spells and 165,414 hospitalised patients. The data includes demographics, serial physiological measurements, laboratory test results, medications, procedures, drugs, mortality and readmission.

Geography: UHB is one of the largest NHS Trusts in England, providing direct acute services & specialist care across four hospital sites, with 2.2 million patient episodes per year, 2750 beds & > 120 ITU bed capacity. UCLH provides first-class acute and specialist services in six hospitals in central London, seeing more than 1 million outpatient and 100,000 admissions per year. Both UHB and UCLH have fully electronic health records. Data has been harmonised using the OMOP data model. Data set availability: Data access is available via the PIONEER Hub for projects which will benefit the public or patients. This can be by developing a new understanding of disease, by providing insights into how to improve care, or by developing new models, tools, treatments, or care processes. Data access can be provided to NHS, academic, commercial, policy and third sector organisations. Applications from SMEs are welcome. There is a single data access process, with public oversight provided by our public review committee, the Data Trust Committee. Contact pioneer@uhb.nhs.uk or visit www.pioneerdatahub.co.uk for more details.

Available supplementary data: Matched controls; ambulance and community data. Unstructured data (images). We can provide the dataset in other common data models and can build synthetic data to meet bespoke requirements.

Available supplementary support: Analytics, model build, validation & refinement; A.I. support. Data partner support for ETL (extract, transform & load) processes. Bespoke and “off the shelf” Trusted Research Environment (TRE) build and run. Consultancy with clinical, patient & end-user and purchaser access/ support. Support for regulatory requirements. Cohort discovery. Data-driven trials and “fast screen” services to assess population size.
common lit external dataset 2021
kaggle.com
Updated Aug 1, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sayantan Kirtaniya (2021). common lit external dataset 2021 [Dataset]. https://www.kaggle.com/sayantankirtaniya/newone/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 1, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sayantan Kirtaniya
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

This is an external dataset that is useful for the CommonLit Readability Prize. The dataset contains five CSV files and one .npy file.

Content

all_data.csv: This dataset has been created from OneStopEnglishCorpus . For making this we have considered the elementary data. All the pre-processing and cleaning has already been done in this dataset.

This dataset has three columns: Elementary, Intermediate and Advanced

Elementary: This column contains the dataset of the elementary school level.

Intermediate: This column contains the dataset of intermediate schooling level.

Advance: This column contains the dataset of advanced schooling levels.

children_books.csv: This dataset has been created from Highly Rated Children Books and Stories. This csv is cleaned and pre-processed part 1 dataset which contains children_books.csv. This dataset contains - Title, Author, Desc, Interest_Rate, Reading_age.

children_stories.csv:This dataset has been created from Highly Rated Children Books and Stories. This csv is cleaned and pre-processed part 2 dataset which contains children_stories.csv. This dataset contains - names, cats, desc.

4.**corpus.csv:** This dataset has been created from the GitHub repository of TovlyDeutsch . The data available there is unorganised and raw, we have organised it, cleaned and pre-processed it properly.

Fullset.csv:This dataset is the Parent set of the whole data mentioned here and all others are the subset of it. We have merged all four datasets after cleaning and pre-processing them. So this full dataset is the final dataset that can be used to calculate readability scores. It has a total of 27283 unique data points. This dataset contains - corpus.

Fullset.npy: This npy file contains the list of the dataset, if someone wants to add or subtract the data he/she can use this file. It helps them to do the task easily and efficiently.
common_voice_17_0
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mozilla Foundation, common_voice_17_0 [Dataset]. https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0
Explore at:
Dataset authored and provided by
Mozilla Foundationhttp://mozilla.org/
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
Dataset Card for Common Voice Corpus 17.0

Dataset Summary

The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 31175 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help improve the accuracy of speech recognition engines. The dataset currently consists of 20408 validated hours in 124 languages, but more voices and languages are always added. Take a look at the Languages page to… See the full description on the dataset page: https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0.
CMS 2008-2010 Data Entrepreneurs’ Synthetic Public Use File (DE-SynPUF) in...
registry.opendata.aws
Updated Jan 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CMS 2008-2010 Data Entrepreneurs’ Synthetic Public Use File (DE-SynPUF) in OMOP Common Data Model [Dataset]. https://registry.opendata.aws/cmsdesynpuf-omop/
Explore at:
Dataset updated
Jan 18, 2023
Dataset provided by
Amazon.comhttp://amazon.com/
Description
DE-SynPUF is provided here as a 1,000 person (1k), 100,000 person (100k), and 2,300,000 persom (2.3m) data sets in the OMOP Common Data Model format. The DE-SynPUF was created with the goal of providing a realistic set of claims data in the public domain while providing the very highest degree of protection to the Medicare beneficiaries’ protected health information. The purposes of the DE-SynPUF are to:

allow data entrepreneurs to develop and create software and applications that may eventually be applied to actual CMS claims data;

train researchers on the use and complexity of conducting analyses with CMS claims data prior to initiating the process to obtain access to actual CMS data; and,

support safe data mining innovations that may reveal unanticipated knowledge gains while preserving beneficiary privacy. The files have been designed so that programs and procedures created on the DE-SynPUF will function on CMS Limited Data Sets. The data structure of the Medicare DE-SynPUF is very similar to the CMS Limited Data Sets, but with a smaller number of variables. The DE-SynPUF also provides a robust set of metadata on the CMS claims data that have not been previously available in the public domain. Although the DE-SynPUF has very limited inferential research value to draw conclusions about Medicare beneficiaries due to the synthetic processes used to create the file, the Medicare DE-SynPUF does increase access to a realistic Medicare claims data file in a timely and less expensive manner to spur the innovation necessary to achieve the goals of better care for beneficiaries and improve the health of the population.
YouTube Datasets
brightdata.com
.json, .csv, .xlsx
Updated Jan 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bright Data (2023). YouTube Datasets [Dataset]. https://brightdata.com/products/datasets/youtube
Explore at:
.json, .csv, .xlsxAvailable download formats
Dataset updated
Jan 9, 2023
Dataset authored and provided by
Bright Datahttps://brightdata.com/
License
https://brightdata.com/licensehttps://brightdata.com/license
Area covered
YouTube, Worldwide
Description
Use our YouTube profiles dataset to extract both business and non-business information from public channels and filter by channel name, views, creation date, or subscribers. Datapoints include URL, handle, banner image, profile image, name, subscribers, description, video count, create date, views, details, and more. You may purchase the entire dataset or a customized subset, depending on your needs. Popular use cases for this dataset include sentiment analysis, brand monitoring, influencer marketing, and more.
f
Data from: Chemical Topic Modeling: Exploring Molecular Data Sets Using a...
figshare.com
acs.figshare.com
zip
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nadine Schneider; Nikolas Fechner; Gregory A. Landrum; Nikolaus Stiefl (2023). Chemical Topic Modeling: Exploring Molecular Data Sets Using a Common Text-Mining Approach [Dataset]. http://doi.org/10.1021/acs.jcim.7b00249.s002
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.7b00249.s002
Dataset updated
Jun 2, 2023
Dataset provided by
ACS Publications
Authors
Nadine Schneider; Nikolas Fechner; Gregory A. Landrum; Nikolaus Stiefl
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Big data is one of the key transformative factors which increasingly influences all aspects of modern life. Although this transformation brings vast opportunities it also generates novel challenges, not the least of which is organizing and searching this data deluge. The field of medicinal chemistry is not different: more and more data are being generated, for instance, by technologies such as DNA encoded libraries, peptide libraries, text mining of large literature corpora, and new in silico enumeration methods. Handling those huge sets of molecules effectively is quite challenging and requires compromises that often come at the expense of the interpretability of the results. In order to find an intuitive and meaningful approach to organizing large molecular data sets, we adopted a probabilistic framework called “topic modeling” from the text-mining field. Here we present the first chemistry-related implementation of this method, which allows large molecule sets to be assigned to “chemical topics” and investigating the relationships between those. In this first study, we thoroughly evaluate this novel method in different experiments and discuss both its disadvantages and advantages. We show very promising results in reproducing human-assigned concepts using the approach to identify and retrieve chemical series from sets of molecules. We have also created an intuitive visualization of the chemical topics output by the algorithm. This is a huge benefit compared to other unsupervised machine-learning methods, like clustering, which are commonly used to group sets of molecules. Finally, we applied the new method to the 1.6 million molecules of the ChEMBL22 data set to test its robustness and efficiency. In about 1 h we built a 100-topic model of this large data set in which we could identify interesting topics like “proteins”, “DNA”, or “steroids”. Along with this publication we provide our data sets and an open-source implementation of the new method (CheTo) which will be part of an upcoming version of the open-source cheminformatics toolkit RDKit.

Facebook

Twitter

Click to copy link

Link copied

Cite

Office of Institutional Research (2023). Harvard Common Data Set [Dataset]. http://doi.org/10.7910/DVN/AOD2ZV

Harvard Common Data Set

Explore at:

Unique identifier

https://doi.org/10.7910/DVN/AOD2ZV

Dataset updated

Nov 21, 2023

Dataset provided by

Harvard Dataverse

Authors

Office of Institutional Research

Description

This represents Harvard's responses to the Common Data Initiative. The Common Data Set (CDS) initiative is a collaborative effort among data providers in the higher education community and publishers as represented by the College Board, Peterson's, and U.S. News & World Report. The combined goal of this collaboration is to improve the quality and accuracy of information provided to all involved in a student's transition into higher education, as well as to reduce the reporting burden on data providers. This goal is attained by the development of clear, standard data items and definitions in order to determine a specific cohort relevant to each item. Data items and definitions used by the U.S. Department of Education in its higher education surveys often serve as a guide in the continued development of the CDS. Common Data Set items undergo broad review by the CDS Advisory Board as well as by data providers representing secondary schools and two- and four-year colleges. Feedback from those who utilize the CDS also is considered throughout the annual review process.

Clear search

Close search

Google apps

Main menu

Harvard Common Data Set

The USyd Campus Dataset

C4 Dataset

Global Roads Open Access Data Set, Version 1 (gROADSv1)

An ontology-based rare disease common data model harmonising international...

common_voice_16_0

Forest Inventory and Analysis Database

NATCOOP dataset

common_voice_11_0

Data from: Training dataset from the Da Vinci Research Kit

Common Voice Dataset

Synthea synthetic patient generator data in OMOP Common Data Model

US Public Schools

Basic Safety Message Data Emulator

DECOVID: Data derived from UCLH and UHB during the COVID pandemic

common lit external dataset 2021

Context

Content

common_voice_17_0

CMS 2008-2010 Data Entrepreneurs’ Synthetic Public Use File (DE-SynPUF) in...

YouTube Datasets

Data from: Chemical Topic Modeling: Exploring Molecular Data Sets Using a...

Harvard Common Data Set