100+ datasets found
  1. d

    Harvard Common Data Set

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Office of Institutional Research (2023). Harvard Common Data Set [Dataset]. http://doi.org/10.7910/DVN/AOD2ZV
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Office of Institutional Research
    Description

    This represents Harvard's responses to the Common Data Initiative. The Common Data Set (CDS) initiative is a collaborative effort among data providers in the higher education community and publishers as represented by the College Board, Peterson's, and U.S. News & World Report. The combined goal of this collaboration is to improve the quality and accuracy of information provided to all involved in a student's transition into higher education, as well as to reduce the reporting burden on data providers. This goal is attained by the development of clear, standard data items and definitions in order to determine a specific cohort relevant to each item. Data items and definitions used by the U.S. Department of Education in its higher education surveys often serve as a guide in the continued development of the CDS. Common Data Set items undergo broad review by the CDS Advisory Board as well as by data providers representing secondary schools and two- and four-year colleges. Feedback from those who utilize the CDS also is considered throughout the annual review process.

  2. i

    The USyd Campus Dataset

    • ieee-dataport.org
    Updated May 18, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wei Zhou (2022). The USyd Campus Dataset [Dataset]. https://ieee-dataport.org/open-access/usyd-campus-dataset
    Explore at:
    Dataset updated
    May 18, 2022
    Authors
    Wei Zhou
    Description

    navigation and deep-learning applications. Despite this success

  3. P

    C4 Dataset

    • paperswithcode.com
    Updated Dec 13, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Colin Raffel; Noam Shazeer; Adam Roberts; Katherine Lee; Sharan Narang; Michael Matena; Yanqi Zhou; Wei Li; Peter J. Liu (2023). C4 Dataset [Dataset]. https://paperswithcode.com/dataset/c4
    Explore at:
    Dataset updated
    Dec 13, 2023
    Authors
    Colin Raffel; Noam Shazeer; Adam Roberts; Katherine Lee; Sharan Narang; Michael Matena; Yanqi Zhou; Wei Li; Peter J. Liu
    Description

    C4 is a colossal, cleaned version of Common Crawl's web crawl corpus. It was based on Common Crawl dataset: https://commoncrawl.org. It was used to train the T5 text-to-text Transformer models.

    The dataset can be downloaded in a pre-processed form from allennlp.

  4. Global Roads Open Access Data Set, Version 1 (gROADSv1)

    • data.nasa.gov
    • datasets.ai
    • +4more
    Updated Apr 23, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). Global Roads Open Access Data Set, Version 1 (gROADSv1) [Dataset]. https://data.nasa.gov/dataset/global-roads-open-access-data-set-version-1-groadsv1
    Explore at:
    Dataset updated
    Apr 23, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    The Global Roads Open Access Data Set, Version 1 (gROADSv1) was developed under the auspices of the CODATA Global Roads Data Development Task Group. The data set combines the best available roads data by country into a global roads coverage, using the UN Spatial Data Infrastructure Transport (UNSDI-T) version 2 as a common data model. All country road networks have been joined topologically at the borders, and many countries have been edited for internal topology. Source data for each country are provided in the documentation, and users are encouraged to refer to the readme file for use constraints that apply to a small number of countries. Because the data are compiled from multiple sources, the date range for road network representations ranges from the 1980s to 2010 depending on the country (most countries have no confirmed date), and spatial accuracy varies. The baseline global data set was compiled by the Information Technology Outreach Services (ITOS) of the University of Georgia. Updated data for 27 countries and 6 smaller geographic entities were assembled by Columbia University's Center for International Earth Science Information Network (CIESIN), with a focus largely on developing countries with the poorest data coverage.

  5. f

    An ontology-based rare disease common data model harmonising international...

    • figshare.com
    csv
    Updated Jan 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adam S.L. Graefe; Sophie AI Klopfenstein; Daniel Danis; Peter N. Robinson; Jana Zschüntzsch; Susanna Wiegand; Peter Kühnen; Oya Beyan; Sylvia Thun; Elisabeth Félicité Nyoungui; Filip Rehburg (2025). An ontology-based rare disease common data model harmonising international registries, FHIR, and Phenopackets [Dataset]. http://doi.org/10.6084/m9.figshare.26509150.v7
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jan 23, 2025
    Dataset provided by
    figshare
    Authors
    Adam S.L. Graefe; Sophie AI Klopfenstein; Daniel Danis; Peter N. Robinson; Jana Zschüntzsch; Susanna Wiegand; Peter Kühnen; Oya Beyan; Sylvia Thun; Elisabeth Félicité Nyoungui; Filip Rehburg
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Please see our GitHub repository here: https://github.com/BIH-CEI/rd-cdm/ Please see our RD CDM documentation here: https://rd-cdm.readthedocs.io/en/latest/index.html/ Attention: The RD CDM paper is currently under review (version 2.0.0.dev0). As soon as the paper is accepted, we will publish v2.0.0. For more information please see our ChangeLog: https://rd-cdm.readthedocs.io/en/latest/changelog.htmlWe introduce our RD CDM v2.0.0— a common data model specifically designed for rare diseases. This RD CDM simplifies the capture, storage, and exchange of complex clinical data, enabling researchers and healthcare providers to work with harmonized datasets across different institutions and countries. The RD CDM is based on the ERDRI-CDS, a common data set developed by the European Rare Disease Research Infrastructure (ERDRI) to support the collection of harmonized data for rare disease research. By extending the ERDRI-CDS with additional concepts and relationships, based on HL7 FHIR v4.0.1 and the GA4GH Phenopacket Schema v2.0, the RD CDM provides a comprehensive model for capturing detailed clinical information alongisde precise genetic data on rare diseases.Background:Rare diseases (RDs), though individually rare, collectively impact over 260 million people worldwide, with over 17 million affected in Europe. These conditions, defined by their low prevalence of fewer than 5 in 10,000 individuals, are often genetically driven, with over 70% of cases suspected to have a genetic cause. Despite significant advances in medical research, RD patients still face lengthy diagnostic delays, often due to a lack of awareness in general healthcare settings and the rarity of RD-specific knowledge among clinicians. Misdiagnosis and underrepresentation in routine care further compound the challenges, leaving many patients without timely and accurate diagnoses.Interoperability plays a critical role in addressing these challenges, ensuring the seamless exchange and interpretation of medical data through the use of internationally agreed standards. In the field of rare diseases, where data is often scarce and scattered, the importance of structured, standardized, and reusable medical records cannot be overstated. Interoperable data formats allow for more efficient research, better care coordination, and a clearer understanding of complex clinical cases. However, existing medical systems often fail to support the depth of phenotypic and genotypic data required for rare disease research and treatment, making interoperability a crucial enabler for improving outcomes in RD care.

  6. common_voice_16_0

    • huggingface.co
    Updated Dec 22, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mozilla Foundation (2023). common_voice_16_0 [Dataset]. https://huggingface.co/datasets/mozilla-foundation/common_voice_16_0
    Explore at:
    Dataset updated
    Dec 22, 2023
    Dataset authored and provided by
    Mozilla Foundationhttp://mozilla.org/
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    Dataset Card for Common Voice Corpus 16

      Dataset Summary
    

    The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 30328 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help improve the accuracy of speech recognition engines. The dataset currently consists of 19673 validated hours in 120 languages, but more voices and languages are always added. Take a look at the Languages page to… See the full description on the dataset page: https://huggingface.co/datasets/mozilla-foundation/common_voice_16_0.

  7. Forest Inventory and Analysis Database

    • data-usfs.hub.arcgis.com
    • datadiscoverystudio.org
    • +9more
    Updated Apr 14, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Forest Service (2017). Forest Inventory and Analysis Database [Dataset]. https://data-usfs.hub.arcgis.com/documents/usfs::forest-inventory-and-analysis-database
    Explore at:
    Dataset updated
    Apr 14, 2017
    Dataset provided by
    U.S. Department of Agriculture Forest Servicehttp://fs.fed.us/
    Authors
    U.S. Forest Service
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Description

    The Forest Inventory and Analysis (FIA) research program has been in existence since mandated by Congress in 1928. FIA's primary objective is to determine the extent, condition, volume, growth, and depletion of timber on the Nation's forest land. Before 1999, all inventories were conducted on a periodic basis. The passage of the 1998 Farm Bill requires FIA to collect data annually on plots within each State. This kind of up-to-date information is essential to frame realistic forest policies and programs. Summary reports for individual States are published but the Forest Service also provides data collected in each inventory to those interested in further analysis. Data is distributed via the FIA DataMart in a standard format. This standard format, referred to as the Forest Inventory and Analysis Database (FIADB) structure, was developed to provide users with as much data as possible in a consistent manner among States. A number of inventories conducted prior to the implementation of the annual inventory are available in the FIADB. However, various data attributes may be empty or the items may have been collected or computed differently. Annual inventories use a common plot design and common data collection procedures nationwide, resulting in greater consistency among FIA work units than earlier inventories. Links to field collection manuals and the FIADB user's manual are provided in the FIA DataMart.

  8. h

    NATCOOP dataset

    • heidata.uni-heidelberg.de
    csv, docx, pdf, tsv +1
    Updated Jan 27, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Florian Diekert; Florian Diekert; Robbert-Jan Schaap; Robbert-Jan Schaap; Tillmann Eymess; Tillmann Eymess (2022). NATCOOP dataset [Dataset]. http://doi.org/10.11588/DATA/GV8NBL
    Explore at:
    docx(90179), pdf(432619), csv(3441765), docx(499022), tsv(86553), pdf(473493), pdf(856157), pdf(467245), docx(101203), pdf(351653), pdf(576588), pdf(200225), pdf(124038), type/x-r-syntax(14339), pdf(345323), pdf(69467), docx(43108), pdf(268168), docx(493800), docx(25110), docx(43036), pdf(270379), pdf(77960), pdf(464499), pdf(392748), docx(42158), pdf(374488), docx(498354), pdf(282466), pdf(482954), pdf(302513), pdf(513748), pdf(126342), docx(33772), tsv(2313475), pdf(441389), pdf(92836), pdf(392718)Available download formats
    Dataset updated
    Jan 27, 2022
    Dataset provided by
    heiDATA
    Authors
    Florian Diekert; Florian Diekert; Robbert-Jan Schaap; Robbert-Jan Schaap; Tillmann Eymess; Tillmann Eymess
    License

    https://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.11588/DATA/GV8NBLhttps://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.11588/DATA/GV8NBL

    Time period covered
    Jan 1, 2017 - Jan 1, 2021
    Dataset funded by
    European Commission
    Description

    The NATCOOP project set out to study how nature shapes the preferences and incentives of economic agents and how this in turn affects common-pool resource management. Imagine a group of fishermen targeting a species that requires a lot of teamwork to harvest. Do these fishers become more social over time compared to fishers that work in a more solitary manner? If so, does this have implications for how the fishery should be managed? To study this, the NATCOOP team travelled to Chile and Tanzania and collected data using surveys and economic experiments. These two very different countries have a large population of small-scale fishermen, and both host several distinct types of fisheries. Over the course of five field trips, the project team surveyed more than 2500 fishermen with each field trip contributing to the main research question by measuring fishermen’s preferences for cooperation and risk. Additionally, each fieldtrip aimed to answer another smaller research question that was either focused on risk taking or cooperation behavior in the fisheries. The data from both surveys and experiments are now publicly available and can be freely studied by other researchers, resource managers, or interested citizens. Overall, the NATCOOP dataset contains participants’ responses to a plethora of survey questions and their actions during incentivized economic experiments. It is available in both the .dta and .csv format, and its use is recommended with statistical software such as R or Stata. For those unaccustomed with statistical analysis, we included a video tutorial on how to use the data set in the open-source program R.

  9. common_voice_11_0

    • huggingface.co
    Updated Nov 3, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mozilla Foundation (2022). common_voice_11_0 [Dataset]. https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0
    Explore at:
    Dataset updated
    Nov 3, 2022
    Dataset authored and provided by
    Mozilla Foundationhttp://mozilla.org/
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    Dataset Card for Common Voice Corpus 11.0

      Dataset Summary
    

    The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 24210 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help improve the accuracy of speech recognition engines. The dataset currently consists of 16413 validated hours in 100 languages, but more voices and languages are always added. Take a look at the Languages page to… See the full description on the dataset page: https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0.

  10. Z

    Data from: Training dataset from the Da Vinci Research Kit

    • data.niaid.nih.gov
    • portaldelainvestigacion.uma.es
    • +1more
    Updated Sep 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Giuseppe Tortora (2022). Training dataset from the Da Vinci Research Kit [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3830937
    Explore at:
    Dataset updated
    Sep 21, 2022
    Dataset provided by
    Giuseppe Tortora
    Andrea Mariani
    Carlos Pérez-del-Pulgar
    Irene Rivas-Blanco
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The use of data sets are getting more relevance in surgical robotics since they can be used to recognise and automate tasks in the lab. Also, it allows to use a common data set to compare different algorithms and methods. The objective of this work is to provide a complete data set of several training tasks that surgeons perform to improve their skills. For this purpose, the Da Vinci research kit has been used to perform a different training tasks. The obtained data set includes all the information provided by the da Vinci robot together with the corresponding video from the camera. Kinematic data has been collected at 50 frames per seconds, and images at 15 frames per seconds. All the information has been carefully timestamped and provided in a readable csv format. The application used to retrieve the information from the da Vinci research kit, as well as tools to access the information are also provided.

  11. P

    Common Voice Dataset

    • paperswithcode.com
    • opendatalab.com
    • +1more
    Updated Jan 7, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rosana Ardila; Megan Branson; Kelly Davis; Michael Henretty; Michael Kohler; Josh Meyer; Reuben Morais; Lindsay Saunders; Francis M. Tyers; Gregor Weber (2021). Common Voice Dataset [Dataset]. https://paperswithcode.com/dataset/common-voice
    Explore at:
    Dataset updated
    Jan 7, 2021
    Authors
    Rosana Ardila; Megan Branson; Kelly Davis; Michael Henretty; Michael Kohler; Josh Meyer; Reuben Morais; Lindsay Saunders; Francis M. Tyers; Gregor Weber
    Description

    Common Voice is an audio dataset that consists of a unique MP3 and corresponding text file. There are 9,283 recorded hours in the dataset. The dataset also includes demographic metadata like age, sex, and accent. The dataset consists of 7,335 validated hours in 60 languages.

  12. Synthea synthetic patient generator data in OMOP Common Data Model

    • registry.opendata.aws
    Updated Jan 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amazon Web Sevices (2023). Synthea synthetic patient generator data in OMOP Common Data Model [Dataset]. https://registry.opendata.aws/synthea-omop/
    Explore at:
    Dataset updated
    Jan 4, 2023
    Dataset provided by
    Amazon.comhttp://amazon.com/
    Description

    The Synthea generated data is provided here as a 1,000 person (1k), 100,000 person (100k), and 2,800,000 persom (2.8m) data sets in the OMOP Common Data Model format. SyntheaTM is a synthetic patient generator that models the medical history of synthetic patients. Our mission is to output high-quality synthetic, realistic but not real, patient data and associated health records covering every aspect of healthcare. The resulting data is free from cost, privacy, and security restrictions. It can be used without restriction for a variety of secondary uses in academia, research, industry, and government (although a citation would be appreciated). You can read our first academic paper here: https://doi.org/10.1093/jamia/ocx079

  13. s

    US Public Schools

    • data.smartidf.services
    • public.opendatasoft.com
    csv, excel, geojson +1
    Updated Jan 6, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). US Public Schools [Dataset]. https://data.smartidf.services/explore/dataset/us-public-schools/
    Explore at:
    geojson, excel, json, csvAvailable download formats
    Dataset updated
    Jan 6, 2023
    License

    https://en.wikipedia.org/wiki/Public_domainhttps://en.wikipedia.org/wiki/Public_domain

    Area covered
    United States
    Description

    This Public Schools feature dataset is composed of all Public elementary and secondary education facilities in the United States as defined by the Common Core of Data (CCD, https://nces.ed.gov/ccd/ ), National Center for Education Statistics (NCES, https://nces.ed.gov ), US Department of Education for the 2017-2018 school year. This includes all Kindergarten through 12th grade schools as tracked by the Common Core of Data. Included in this dataset are military schools in US territories and referenced in the city field with an APO or FPO address. DOD schools represented in the NCES data that are outside of the United States or US territories have been omitted. This feature class contains all MEDS/MEDS+ as approved by NGA. Complete field and attribute information is available in the ”Entities and Attributes” metadata section. Geographical coverage is depicted in the thumbnail above and detailed in the Place Keyword section of the metadata. This release includes the addition of 3065 new records, modifications to the spatial location and/or attribution of 99,287 records, and removal of 2996 records not present in the NCES CCD data.

  14. d

    Basic Safety Message Data Emulator

    • catalog.data.gov
    • data.transportation.gov
    • +3more
    Updated Mar 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    US Department of Transportation (2025). Basic Safety Message Data Emulator [Dataset]. https://catalog.data.gov/dataset/basic-safety-message-data-emulator
    Explore at:
    Dataset updated
    Mar 16, 2025
    Dataset provided by
    US Department of Transportation
    Description

    The Trajectory Conversion Algorithm Version 2.3 (TCA) is designed to test different strategies for producing, transmitting, and storing Connected Vehicle information. The TCA uses vehicle trajectory data, roadside equipment (RSE) location information, cellular region information and strategy information to emulate the messages connected vehicles would produce. This data set contains common data sets generated by the TCA using the BSM and PDM at 100% market penetration for two simulated traffic networks, an arterial network (Van Ness Avenue in San Francisco, CA) and a freeway network (the interchange of I-270 and I-44 in St. Louis, MO). This legacy dataset was created before data.transportation.gov and is only currently available via the attached file(s). Please contact the dataset owner if there is a need for users to work with this data using the data.transportation.gov analysis features (online viewing, API, graphing, etc.) and the USDOT will consider modifying the dataset to fully integrate in data.transportation.gov.

  15. h

    DECOVID: Data derived from UCLH and UHB during the COVID pandemic

    • healthdatagateway.org
    unknown
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    This publication uses data from PIONEER, an ethically approved database and analytical environment (East Midlands Derby Research Ethics 20/EM/0158), DECOVID: Data derived from UCLH and UHB during the COVID pandemic [Dataset]. https://healthdatagateway.org/dataset/998
    Explore at:
    unknownAvailable download formats
    Dataset authored and provided by
    This publication uses data from PIONEER, an ethically approved database and analytical environment (East Midlands Derby Research Ethics 20/EM/0158)
    License

    https://www.pioneerdatahub.co.uk/data/data-request-process/https://www.pioneerdatahub.co.uk/data/data-request-process/

    Description

    DECOVID, a multi-centre research consortium, was founded in March 2020 by two United Kingdom (UK) National Health Service (NHS) Foundation Trusts (comprising three acute care hospitals) and three research institutes/universities: University Hospitals Birmingham (UHB), University College London Hospitals (UCLH), University of Birmingham, University College London and The Alan Turing Institute. The original aim of DECOVID was to share harmonised electronic health record (EHR) data from UCLH and UHB to enable researchers affiliated with the DECOVID consortium to answer clinical questions to support the COVID-19 response.   ​​   ​​The DECOVID database has now been placed within the infrastructure of PIONEER, a Health Data Research (HDR) UK funded data hub that contains data from acute care providers, to make the DECOVID database accessible to external researchers not affiliated with the DECOVID consortium.  

    This highly granular dataset contains 256,804 spells and 165,414 hospitalised patients. The data includes demographics, serial physiological measurements, laboratory test results, medications, procedures, drugs, mortality and readmission.

    Geography: UHB is one of the largest NHS Trusts in England, providing direct acute services & specialist care across four hospital sites, with 2.2 million patient episodes per year, 2750 beds & > 120 ITU bed capacity. UCLH provides first-class acute and specialist services in six hospitals in central London, seeing more than 1 million outpatient and 100,000 admissions per year. Both UHB and UCLH have fully electronic health records. Data has been harmonised using the OMOP data model. Data set availability: Data access is available via the PIONEER Hub for projects which will benefit the public or patients. This can be by developing a new understanding of disease, by providing insights into how to improve care, or by developing new models, tools, treatments, or care processes. Data access can be provided to NHS, academic, commercial, policy and third sector organisations. Applications from SMEs are welcome. There is a single data access process, with public oversight provided by our public review committee, the Data Trust Committee. Contact pioneer@uhb.nhs.uk or visit www.pioneerdatahub.co.uk for more details.

    Available supplementary data: Matched controls; ambulance and community data. Unstructured data (images). We can provide the dataset in other common data models and can build synthetic data to meet bespoke requirements.

    Available supplementary support: Analytics, model build, validation & refinement; A.I. support. Data partner support for ETL (extract, transform & load) processes. Bespoke and “off the shelf” Trusted Research Environment (TRE) build and run. Consultancy with clinical, patient & end-user and purchaser access/ support. Support for regulatory requirements. Cohort discovery. Data-driven trials and “fast screen” services to assess population size.

  16. common lit external dataset 2021

    • kaggle.com
    Updated Aug 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sayantan Kirtaniya (2021). common lit external dataset 2021 [Dataset]. https://www.kaggle.com/sayantankirtaniya/newone/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 1, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sayantan Kirtaniya
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    This is an external dataset that is useful for the CommonLit Readability Prize. The dataset contains five CSV files and one .npy file.

    Content

    1. all_data.csv: This dataset has been created from OneStopEnglishCorpus . For making this we have considered the elementary data. All the pre-processing and cleaning has already been done in this dataset.

      • This dataset has three columns: Elementary, Intermediate and Advanced

        • Elementary: This column contains the dataset of the elementary school level.
        • Intermediate: This column contains the dataset of intermediate schooling level.
        • Advance: This column contains the dataset of advanced schooling levels.
    2. children_books.csv: This dataset has been created from Highly Rated Children Books and Stories. This csv is cleaned and pre-processed part 1 dataset which contains children_books.csv. This dataset contains - Title, Author, Desc, Interest_Rate, Reading_age.

    3. children_stories.csv:This dataset has been created from Highly Rated Children Books and Stories. This csv is cleaned and pre-processed part 2 dataset which contains children_stories.csv. This dataset contains - names, cats, desc.

    4.**corpus.csv:** This dataset has been created from the GitHub repository of TovlyDeutsch . The data available there is unorganised and raw, we have organised it, cleaned and pre-processed it properly.

    1. Fullset.csv:This dataset is the Parent set of the whole data mentioned here and all others are the subset of it. We have merged all four datasets after cleaning and pre-processing them. So this full dataset is the final dataset that can be used to calculate readability scores. It has a total of 27283 unique data points. This dataset contains - corpus.

    2. Fullset.npy: This npy file contains the list of the dataset, if someone wants to add or subtract the data he/she can use this file. It helps them to do the task easily and efficiently.

  17. common_voice_17_0

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mozilla Foundation, common_voice_17_0 [Dataset]. https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0
    Explore at:
    Dataset authored and provided by
    Mozilla Foundationhttp://mozilla.org/
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    Dataset Card for Common Voice Corpus 17.0

      Dataset Summary
    

    The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 31175 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help improve the accuracy of speech recognition engines. The dataset currently consists of 20408 validated hours in 124 languages, but more voices and languages are always added. Take a look at the Languages page to… See the full description on the dataset page: https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0.

  18. CMS 2008-2010 Data Entrepreneurs’ Synthetic Public Use File (DE-SynPUF) in...

    • registry.opendata.aws
    Updated Jan 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CMS 2008-2010 Data Entrepreneurs’ Synthetic Public Use File (DE-SynPUF) in OMOP Common Data Model [Dataset]. https://registry.opendata.aws/cmsdesynpuf-omop/
    Explore at:
    Dataset updated
    Jan 18, 2023
    Dataset provided by
    Amazon.comhttp://amazon.com/
    Description

    DE-SynPUF is provided here as a 1,000 person (1k), 100,000 person (100k), and 2,300,000 persom (2.3m) data sets in the OMOP Common Data Model format. The DE-SynPUF was created with the goal of providing a realistic set of claims data in the public domain while providing the very highest degree of protection to the Medicare beneficiaries’ protected health information. The purposes of the DE-SynPUF are to:

    1. allow data entrepreneurs to develop and create software and applications that may eventually be applied to actual CMS claims data;
    2. train researchers on the use and complexity of conducting analyses with CMS claims data prior to initiating the process to obtain access to actual CMS data; and,
    3. support safe data mining innovations that may reveal unanticipated knowledge gains while preserving beneficiary privacy. The files have been designed so that programs and procedures created on the DE-SynPUF will function on CMS Limited Data Sets. The data structure of the Medicare DE-SynPUF is very similar to the CMS Limited Data Sets, but with a smaller number of variables. The DE-SynPUF also provides a robust set of metadata on the CMS claims data that have not been previously available in the public domain. Although the DE-SynPUF has very limited inferential research value to draw conclusions about Medicare beneficiaries due to the synthetic processes used to create the file, the Medicare DE-SynPUF does increase access to a realistic Medicare claims data file in a timely and less expensive manner to spur the innovation necessary to achieve the goals of better care for beneficiaries and improve the health of the population.

  19. YouTube Datasets

    • brightdata.com
    .json, .csv, .xlsx
    Updated Jan 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bright Data (2023). YouTube Datasets [Dataset]. https://brightdata.com/products/datasets/youtube
    Explore at:
    .json, .csv, .xlsxAvailable download formats
    Dataset updated
    Jan 9, 2023
    Dataset authored and provided by
    Bright Datahttps://brightdata.com/
    License

    https://brightdata.com/licensehttps://brightdata.com/license

    Area covered
    YouTube, Worldwide
    Description

    Use our YouTube profiles dataset to extract both business and non-business information from public channels and filter by channel name, views, creation date, or subscribers. Datapoints include URL, handle, banner image, profile image, name, subscribers, description, video count, create date, views, details, and more. You may purchase the entire dataset or a customized subset, depending on your needs. Popular use cases for this dataset include sentiment analysis, brand monitoring, influencer marketing, and more.

  20. f

    Data from: Chemical Topic Modeling: Exploring Molecular Data Sets Using a...

    • figshare.com
    • acs.figshare.com
    zip
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nadine Schneider; Nikolas Fechner; Gregory A. Landrum; Nikolaus Stiefl (2023). Chemical Topic Modeling: Exploring Molecular Data Sets Using a Common Text-Mining Approach [Dataset]. http://doi.org/10.1021/acs.jcim.7b00249.s002
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    ACS Publications
    Authors
    Nadine Schneider; Nikolas Fechner; Gregory A. Landrum; Nikolaus Stiefl
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Big data is one of the key transformative factors which increasingly influences all aspects of modern life. Although this transformation brings vast opportunities it also generates novel challenges, not the least of which is organizing and searching this data deluge. The field of medicinal chemistry is not different: more and more data are being generated, for instance, by technologies such as DNA encoded libraries, peptide libraries, text mining of large literature corpora, and new in silico enumeration methods. Handling those huge sets of molecules effectively is quite challenging and requires compromises that often come at the expense of the interpretability of the results. In order to find an intuitive and meaningful approach to organizing large molecular data sets, we adopted a probabilistic framework called “topic modeling” from the text-mining field. Here we present the first chemistry-related implementation of this method, which allows large molecule sets to be assigned to “chemical topics” and investigating the relationships between those. In this first study, we thoroughly evaluate this novel method in different experiments and discuss both its disadvantages and advantages. We show very promising results in reproducing human-assigned concepts using the approach to identify and retrieve chemical series from sets of molecules. We have also created an intuitive visualization of the chemical topics output by the algorithm. This is a huge benefit compared to other unsupervised machine-learning methods, like clustering, which are commonly used to group sets of molecules. Finally, we applied the new method to the 1.6 million molecules of the ChEMBL22 data set to test its robustness and efficiency. In about 1 h we built a 100-topic model of this large data set in which we could identify interesting topics like “proteins”, “DNA”, or “steroids”. Along with this publication we provide our data sets and an open-source implementation of the new method (CheTo) which will be part of an upcoming version of the open-source cheminformatics toolkit RDKit.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Office of Institutional Research (2023). Harvard Common Data Set [Dataset]. http://doi.org/10.7910/DVN/AOD2ZV

Harvard Common Data Set

Explore at:
Dataset updated
Nov 21, 2023
Dataset provided by
Harvard Dataverse
Authors
Office of Institutional Research
Description

This represents Harvard's responses to the Common Data Initiative. The Common Data Set (CDS) initiative is a collaborative effort among data providers in the higher education community and publishers as represented by the College Board, Peterson's, and U.S. News & World Report. The combined goal of this collaboration is to improve the quality and accuracy of information provided to all involved in a student's transition into higher education, as well as to reduce the reporting burden on data providers. This goal is attained by the development of clear, standard data items and definitions in order to determine a specific cohort relevant to each item. Data items and definitions used by the U.S. Department of Education in its higher education surveys often serve as a guide in the continued development of the CDS. Common Data Set items undergo broad review by the CDS Advisory Board as well as by data providers representing secondary schools and two- and four-year colleges. Feedback from those who utilize the CDS also is considered throughout the annual review process.

Search
Clear search
Close search
Google apps
Main menu