100+ datasets found
  1. FStarDataSet-V2

    • huggingface.co
    Updated Sep 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Microsoft (2024). FStarDataSet-V2 [Dataset]. https://huggingface.co/datasets/microsoft/FStarDataSet-V2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 4, 2024
    Dataset authored and provided by
    Microsofthttp://microsoft.com/
    License

    https://choosealicense.com/licenses/cdla-permissive-2.0/https://choosealicense.com/licenses/cdla-permissive-2.0/

    Description

    This dataset is the Version 2.0 of microsoft/FStarDataSet.

      Primary-Objective
    

    This dataset's primary objective is to train and evaluate Proof-oriented Programming with AI (PoPAI, in short). Given a specification of a program and proof in F*, the objective of a AI model is to synthesize the implemantation (see below for details about the usage of this dataset, including the input and output).

      Data Format
    

    Each of the examples in this dataset are organized as dictionaries… See the full description on the dataset page: https://huggingface.co/datasets/microsoft/FStarDataSet-V2.

  2. SECs Compiled Financial Statements & Notes Dataset

    • kaggle.com
    Updated Jul 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Deny Tran (2024). SECs Compiled Financial Statements & Notes Dataset [Dataset]. https://www.kaggle.com/datasets/denytran/im-a-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 31, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Deny Tran
    License

    https://www.usa.gov/government-works/https://www.usa.gov/government-works/

    Description

    This dataset is from the SEC's Financial Statements and Notes Data Set.
    It was a personal project to see if I could make the queries efficient.
    It's just been collecting dust ever since, maybe someone will make good use of it.
    Data is up to about early-2024.
    It doesn't differ from the source, other than it's compiled - so maybe you can try it out, then compile your own (with the link below).
    Dataset was created using SEC Files and SQL Server on Docker.
    For details on the SQL Server database this came from, see: "dataset-previous-life-info" folder, which will contain: - Row Counts - Primary/Foreign Keys - SQL Statements to recreate database tables - Example queries on how to join the data tables. - A pretty picture of the table associations. Source: https://www.sec.gov/data-research/financial-statement-notes-data-sets

    Happy coding!

  3. d

    OpenFEMA Data Set Fields

    • catalog.data.gov
    • datasets.ai
    Updated Jun 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FEMA/Mission Support/Off of Chf Information Officer (2025). OpenFEMA Data Set Fields [Dataset]. https://catalog.data.gov/dataset/openfema-data-set-fields
    Explore at:
    Dataset updated
    Jun 7, 2025
    Dataset provided by
    FEMA/Mission Support/Off of Chf Information Officer
    Description

    Metadata for the OpenFEMA API data set fields. It contains descriptions, data types, and other attributes for each field.rnrnIf you have media inquiries about this dataset please email the FEMA News Desk FEMA-News-Desk@dhs.gov or call (202) 646-3272. For inquiries about FEMA's data and Open government program please contact the OpenFEMA team via email OpenFEMA@fema.dhs.gov.

  4. Dataset - Understanding the software and data used in the social sciences

    • zenodo.org
    • eprints.soton.ac.uk
    pdf, zip
    Updated Jul 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Selina Aragon; Selina Aragon; Mario Antonioletti; Mario Antonioletti; Johanna Walker; Johanna Walker; Neil Chue Hong; Neil Chue Hong (2024). Dataset - Understanding the software and data used in the social sciences [Dataset]. http://doi.org/10.5281/zenodo.7785711
    Explore at:
    zip, pdfAvailable download formats
    Dataset updated
    Jul 12, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Selina Aragon; Selina Aragon; Mario Antonioletti; Mario Antonioletti; Johanna Walker; Johanna Walker; Neil Chue Hong; Neil Chue Hong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is a repository for a UKRI Economic and Social Research Council (ESRC) funded project to understand the software used to analyse social sciences data.

    Any software produced has been made available under a BSD 2-Clause license and any data and other non-software derivative is made available under a CC-BY 4.0 International License. Note that the software that analysed the survey is provided for illustrative purposes - it will not work on the decoupled anonymised data set.

    Exceptions to this are:

    Contents

    • Survey data & analysis: esrc_data-survey-analysis-data.zip
    • Other data: esrc_data-other-data.zip
    • Transcripts: esrc_data-transcripts.zip
    • Data Management Plan: esrc_data-dmp.zip

    Survey data & analysis

    The survey ran from 3rd February 2022 to 6th March 2023 during which 168 responses were received. Of these responses, three were removed because they were supplied by people from outside the UK without a clear indication of involvement with the UK or associated infrastructure. A fourth response was removed as both came from the same person which leaves us with 164 responses in the data.

    The survey responses, Question (Q) Q1-Q16, have been decoupled from the demographic data, Q17-Q23. Questions Q24-Q28 are for follow-up and have been removed from the data. The institutions (Q17) and funding sources (Q18) have been provided in a separate file as this could be used to identify respondents. Q17, Q18 and Q19-Q23 have all been independently shuffled.

    The data has been made available as Comma Separated Values (CSV) with the question number as the header of each column and the encoded responses in the column below. To see what the question and the responses correspond to you will have to consult the survey-results-key.csv which decodes the question and responses accordingly.

    A pdf copy of the survey questions is available on GitHub.

    The survey data has been decoupled into:

    • survey-results-key.csv - maps a question number and the responses to the actual question values.
    • q1-16-survey-results.csv- the non-demographic component of the survey responses (Q1-Q16).
    • q19-23-demographics.csv - the demographic part of the survey (Q19-Q21, Q23).
    • q17-institutions.csv - the institution/location of the respondent (Q17).
    • q18-funding.csv - funding sources within the last 5 years (Q18).

    Please note the code that has been used to do the analysis will not run with the decoupled survey data.

    Other data files included

    • CleanedLocations.csv - normalised version of the institutions that the survey respondents volunteered.
    • DTPs.csv - information on the UKRI Doctoral Training Partnerships (DTPs) scaped from the UKRI DTP contacts web page in October 2021.
    • projectsearch-1646403729132.csv.gz - data snapshot from the UKRI Gateway to Research released on the 24th February 2022 made available under an Open Government Licence.
    • locations.csv - latitude and longitude for the institutions in the cleaned locations.
    • subjects.csv - research classifications for the ESRC projects for the 24th February data snapshot.
    • topics.csv - topic classification for the ESRC projects for the 24th February data snapshot.

    Interview transcripts

    The interview transcripts have been anonymised and converted to markdown so that it's easier to process in general. List of interview transcripts:

    • 1269794877.md
    • 1578450175.md
    • 1792505583.md
    • 2964377624.md
    • 3270614512.md
    • 40983347262.md
    • 4288358080.md
    • 4561769548.md
    • 4938919540.md
    • 5037840428.md
    • 5766299900.md
    • 5996360861.md
    • 6422621713.md
    • 6776362537.md
    • 7183719943.md
    • 7227322280.md
    • 7336263536.md
    • 75909371872.md
    • 7869268779.md
    • 8031500357.md
    • 9253010492.md

    Data Management Plan

    The study's Data Management Plan is provided in PDF format and shows the different data sets used throughout the duration of the study and where they have been deposited, as well as how long the SSI will keep these records.

  5. P

    KaggleDBQA Dataset

    • paperswithcode.com
    Updated Jan 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chia-Hsuan Lee; Oleksandr Polozov; Matthew Richardson (2025). KaggleDBQA Dataset [Dataset]. https://paperswithcode.com/dataset/kaggledbqa
    Explore at:
    Dataset updated
    Jan 20, 2025
    Authors
    Chia-Hsuan Lee; Oleksandr Polozov; Matthew Richardson
    Description

    KaggleDBQA is a challenging cross-domain and complex evaluation dataset of real Web databases, with domain-specific data types, original formatting, and unrestricted questions.

    It expands upon contemporary cross-domain text-to-SQL datasets in three key aspects: (1) Its databases are pulled from real-world data sources and not normalized. (2) Its questions are authored in environments that mimic natural question answering. (3) It also provides database documentation that contains rich in-domain knowledge.

  6. Data from: Data Sets for Evaluation of Building Fault Detection and...

    • osti.gov
    • data.openei.org
    • +1more
    Updated Feb 26, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lin, Guanjing; Mitchell, Robin (2019). Data Sets for Evaluation of Building Fault Detection and Diagnostics Algorithms [Dataset]. https://www.osti.gov/dataexplorer/biblio/dataset/1824861-data-sets-evaluation-building-fault-detection-diagnostics-algorithms
    Explore at:
    Dataset updated
    Feb 26, 2019
    Dataset provided by
    United States Department of Energyhttp://energy.gov/
    49.2637,-66.5318|24.5873,-66.5318|24.5873,-125.4514|49.2637,-125.4514|49.2637,-66.5318
    DOE Open Energy Data Initiative (OEDI); Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
    Authors
    Lin, Guanjing; Mitchell, Robin
    Description

    This documentation and dataset can be used to test the performance of automated fault detection and diagnostics algorithms for buildings. The dataset was created by LBNL, PNNL, NREL, ORNL and ASHRAE RP-1312 (Drexel University). It includes data for air-handling units and rooftop units simulated with PNNL's large office building model.

  7. h

    AI-Generated-vs-Real-Images-Datasets

    • huggingface.co
    Updated Mar 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hem Bahadur Gurung (2025). AI-Generated-vs-Real-Images-Datasets [Dataset]. https://huggingface.co/datasets/Hemg/AI-Generated-vs-Real-Images-Datasets
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 27, 2025
    Authors
    Hem Bahadur Gurung
    Description

    Dataset Card for "AI-Generated-vs-Real-Images-Datasets"

    More Information needed

  8. High School Heights Dataset

    • kaggle.com
    Updated Aug 11, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yashmeet Singh (2022). High School Heights Dataset [Dataset]. https://www.kaggle.com/datasets/yashmeetsingh/high-school-heights-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 11, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Yashmeet Singh
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    High School Heights Dataset

    You will find three datasets containing heights of the high school students.

    All heights are in inches.

    The data is simulated. The heights are generated from a normal distribution with different sets of mean and standard deviation for boys and girls.

    Height Statistics (inches)BoysGirls
    Mean6762
    Standard Deviation2.92.2

    There are 500 measurements for each gender.

    Here are the datasets:

    • hs_heights.csv: contains a single column with heights for all boys and girls. There's no way to tell which of the values are for boys and which ones are for girls.

    • hs_heights_pair.csv: has two columns. The first column has boy's heights. The second column contains girl's heights.

    • hs_heights_flag.csv: has two columns. The first column has the flag is_girl. The second column contains a girl's height if the flag is 1. Otherwise, it contains a boy's height.

    To see how I generated this dataset, check this out: https://github.com/ysk125103/datascience101/tree/main/datasets/high_school_heights

    Image by Gillian Callison from Pixabay

  9. LinkedIn Datasets

    • brightdata.com
    .json, .csv, .xlsx
    Updated Dec 17, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bright Data (2021). LinkedIn Datasets [Dataset]. https://brightdata.com/products/datasets/linkedin
    Explore at:
    .json, .csv, .xlsxAvailable download formats
    Dataset updated
    Dec 17, 2021
    Dataset authored and provided by
    Bright Datahttps://brightdata.com/
    License

    https://brightdata.com/licensehttps://brightdata.com/license

    Area covered
    Worldwide
    Description

    Unlock the full potential of LinkedIn data with our extensive dataset that combines profiles, company information, and job listings into one powerful resource for business decision-making, strategic hiring, competitive analysis, and market trend insights. This all-encompassing dataset is ideal for professionals, recruiters, analysts, and marketers aiming to enhance their strategies and operations across various business functions. Dataset Features

    Profiles: Dive into detailed public profiles featuring names, titles, positions, experience, education, skills, and more. Utilize this data for talent sourcing, lead generation, and investment signaling, with a refresh rate ensuring up to 30 million records per month. Companies: Access comprehensive company data including ID, country, industry, size, number of followers, website details, subsidiaries, and posts. Tailored subsets by industry or region provide invaluable insights for CRM enrichment, competitive intelligence, and understanding the startup ecosystem, updated monthly with up to 40 million records. Job Listings: Explore current job opportunities detailed with job titles, company names, locations, and employment specifics such as seniority levels and employment functions. This dataset includes direct application links and real-time application numbers, serving as a crucial tool for job seekers and analysts looking to understand industry trends and the job market dynamics.

    Customizable Subsets for Specific Needs Our LinkedIn dataset offers the flexibility to tailor the dataset according to your specific business requirements. Whether you need comprehensive insights across all data points or are focused on specific segments like job listings, company profiles, or individual professional details, we can customize the dataset to match your needs. This modular approach ensures that you get only the data that is most relevant to your objectives, maximizing efficiency and relevance in your strategic applications. Popular Use Cases

    Strategic Hiring and Recruiting: Track talent movement, identify growth opportunities, and enhance your recruiting efforts with targeted data. Market Analysis and Competitive Intelligence: Gain a competitive edge by analyzing company growth, industry trends, and strategic opportunities. Lead Generation and CRM Enrichment: Enrich your database with up-to-date company and professional data for targeted marketing and sales strategies. Job Market Insights and Trends: Leverage detailed job listings for a nuanced understanding of employment trends and opportunities, facilitating effective job matching and market analysis. AI-Driven Predictive Analytics: Utilize AI algorithms to analyze large datasets for predicting industry shifts, optimizing business operations, and enhancing decision-making processes based on actionable data insights.

    Whether you are mapping out competitive landscapes, sourcing new talent, or analyzing job market trends, our LinkedIn dataset provides the tools you need to succeed. Customize your access to fit specific needs, ensuring that you have the most relevant and timely data at your fingertips.

  10. m

    Data from: Active Sonar Data Set

    • data.mendeley.com
    • search.datacite.org
    Updated Oct 9, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohammad Khishe (2017). Active Sonar Data Set [Dataset]. http://doi.org/10.17632/fyxjjwzphf.1
    Explore at:
    Dataset updated
    Oct 9, 2017
    Authors
    Mohammad Khishe
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In this data set, 6 objects including 2 targets and 4 non-targets lay on the sea sand bottom. Upon this experiment, the transmitted signal is Wide-Band Linear Frequency Modulated Pulse (WLFM) which covers frequency range 5-110 KHz. Targets lay on the bottom rotate 180 degrees with 1 degree accuracy via electromotor. Off target to 10 meters backscattered echoes are accumulated. Fine dataset takes key role in sonar target classification. Regarding massive raw data obtained from previous stage, above massive calculation will be expected. To reduce calculation burden relating to classifying and extracting feature, it is essential to detect targets out of total received data. To implement this, the intensity of the received signal is used. It is inevitable to consider multi-path propagation, secondary reflections, and reverberation due to shoal of the region. The researcher attempts to eliminate artifact tract after detecting stage and before extracting feature by the use of a matched filter.

  11. Instagram Dataset

    • brightdata.com
    .json, .csv, .xlsx
    Updated Apr 26, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bright Data (2022). Instagram Dataset [Dataset]. https://brightdata.com/products/datasets/instagram
    Explore at:
    .json, .csv, .xlsxAvailable download formats
    Dataset updated
    Apr 26, 2022
    Dataset authored and provided by
    Bright Datahttps://brightdata.com/
    License

    https://brightdata.com/licensehttps://brightdata.com/license

    Area covered
    Worldwide
    Description

    Use our Instagram dataset (public data) to extract business and non-business information from complete public profiles and filter by hashtags, followers, account type, or engagement score. Depending on your needs, you may purchase the entire dataset or a customized subset. Popular use cases include sentiment analysis, brand monitoring, influencer marketing, and more. The dataset includes all major data points: # of followers, verified status, account type (business / non-business), links, posts, comments, location, engagement score, hashtags, and much more.

  12. I

    Cline Center Coup d’État Project Dataset

    • databank.illinois.edu
    Updated May 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Buddy Peyton; Joseph Bajjalieh; Dan Shalmon; Michael Martin; Emilio Soto (2025). Cline Center Coup d’État Project Dataset [Dataset]. http://doi.org/10.13012/B2IDB-9651987_V7
    Explore at:
    Dataset updated
    May 11, 2025
    Authors
    Buddy Peyton; Joseph Bajjalieh; Dan Shalmon; Michael Martin; Emilio Soto
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Coups d'Ètat are important events in the life of a country. They constitute an important subset of irregular transfers of political power that can have significant and enduring consequences for national well-being. There are only a limited number of datasets available to study these events (Powell and Thyne 2011, Marshall and Marshall 2019). Seeking to facilitate research on post-WWII coups by compiling a more comprehensive list and categorization of these events, the Cline Center for Advanced Social Research (previously the Cline Center for Democracy) initiated the Coup d’État Project as part of its Societal Infrastructures and Development (SID) project. More specifically, this dataset identifies the outcomes of coup events (i.e., realized, unrealized, or conspiracy) the type of actor(s) who initiated the coup (i.e., military, rebels, etc.), as well as the fate of the deposed leader. Version 2.1.3 adds 19 additional coup events to the data set, corrects the date of a coup in Tunisia, and reclassifies an attempted coup in Brazil in December 2022 to a conspiracy. Version 2.1.2 added 6 additional coup events that occurred in 2022 and updated the coding of an attempted coup event in Kazakhstan in January 2022. Version 2.1.1 corrected a mistake in version 2.1.0, where the designation of “dissident coup” had been dropped in error for coup_id: 00201062021. Version 2.1.1 fixed this omission by marking the case as both a dissident coup and an auto-coup. Version 2.1.0 added 36 cases to the data set and removed two cases from the v2.0.0 data. This update also added actor coding for 46 coup events and added executive outcomes to 18 events from version 2.0.0. A few other changes were made to correct inconsistencies in the coup ID variable and the date of the event. Version 2.0.0 improved several aspects of the previous version (v1.0.0) and incorporated additional source material to include: • Reconciling missing event data • Removing events with irreconcilable event dates • Removing events with insufficient sourcing (each event needs at least two sources) • Removing events that were inaccurately coded as coup events • Removing variables that fell below the threshold of inter-coder reliability required by the project • Removing the spreadsheet ‘CoupInventory.xls’ because of inadequate attribution and citations in the event summaries • Extending the period covered from 1945-2005 to 1945-2019 • Adding events from Powell and Thyne’s Coup Data (Powell and Thyne, 2011)
    Items in this Dataset 1. Cline Center Coup d'État Codebook v.2.1.3 Codebook.pdf - This 15-page document describes the Cline Center Coup d’État Project dataset. The first section of this codebook provides a summary of the different versions of the data. The second section provides a succinct definition of a coup d’état used by the Coup d'État Project and an overview of the categories used to differentiate the wide array of events that meet the project's definition. It also defines coup outcomes. The third section describes the methodology used to produce the data. Revised February 2024 2. Coup Data v2.1.3.csv - This CSV (Comma Separated Values) file contains all of the coup event data from the Cline Center Coup d’État Project. It contains 29 variables and 1000 observations. Revised February 2024 3. Source Document v2.1.3.pdf - This 325-page document provides the sources used for each of the coup events identified in this dataset. Please use the value in the coup_id variable to identify the sources used to identify that particular event. Revised February 2024 4. README.md - This file contains useful information for the user about the dataset. It is a text file written in markdown language. Revised February 2024
    Citation Guidelines 1. To cite the codebook (or any other documentation associated with the Cline Center Coup d’État Project Dataset) please use the following citation: Peyton, Buddy, Joseph Bajjalieh, Dan Shalmon, Michael Martin, Jonathan Bonaguro, and Scott Althaus. 2024. “Cline Center Coup d’État Project Dataset Codebook”. Cline Center Coup d’État Project Dataset. Cline Center for Advanced Social Research. V.2.1.3. February 27. University of Illinois Urbana-Champaign. doi: 10.13012/B2IDB-9651987_V7 2. To cite data from the Cline Center Coup d’État Project Dataset please use the following citation (filling in the correct date of access): Peyton, Buddy, Joseph Bajjalieh, Dan Shalmon, Michael Martin, Jonathan Bonaguro, and Emilio Soto. 2024. Cline Center Coup d’État Project Dataset. Cline Center for Advanced Social Research. V.2.1.3. February 27. University of Illinois Urbana-Champaign. doi: 10.13012/B2IDB-9651987_V7

  13. d

    Project Management

    • catalog.data.gov
    • datasets.ai
    • +1more
    Updated May 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Office of Project Management (2025). Project Management [Dataset]. https://catalog.data.gov/dataset/project-management
    Explore at:
    Dataset updated
    May 2, 2025
    Dataset provided by
    Office of Project Management
    Description

    the Department of Energy’s Enterprise Project Management Organization (EPMO), providing leadership and assistance in developing and implementing DOE-wide policies, procedures, programs, and management systems pertaining to project management, and independently monitors, assesses, and reports on project execution performance. The office validates project performance baselines–scope, cost and schedule–of the Department’s largest construction and environmental clean-up projects prior to budget request to Congress—an active project portfolio totaling over $30 billion. The office also serves as Executive Secretariat for the Department’s Energy Systems Acquisition Advisory Board (ESAAB) and the Project Management Risk Committee (PMRC). In these capacities, the Director is accountable to the Deputy Secretary.

  14. D

    History of work (all graph datasets)

    • druid.datalegend.net
    • iisg.amsterdam
    application/n-quads +5
    Updated Apr 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    History of Work (2025). History of work (all graph datasets) [Dataset]. https://druid.datalegend.net/HistoryOfWork/historyOfWork-all-latest
    Explore at:
    application/n-quads, application/n-triples, application/trig, ttl, jsonld, application/sparql-results+jsonAvailable download formats
    Dataset updated
    Apr 18, 2025
    Dataset authored and provided by
    History of Work
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    History of Work

    Here you find the History of Work resources as Linked Open Data. It enables you to look ups for HISCO and HISCAM scores for an incredible amount of occupational titles in numerous languages.

    Data can be queried (obtained) via the SPARQL endpoint or via the example queries. If the Linked Open Data format is new to you, you might enjoy these data stories on History of Work as Linked Open Data and this user question on Is there a list of female occupations?.

    NEW version - CHANGE notes

    This version is dated Apr 2025 and is not backwards compatible with the previous version (Feb 2021). The major changes are: - incredible simplification of graph representation (from 81 to 12); - use of sdo (https://schema.org/) rather than schema (http://schema.org); - replacement of prov:wasDerivedFrom with sdo:isPartOf to link occupational titles to originating datasets; - etl files (used for conversion to Linked Data) now publicly available via https://github.com/rlzijdeman/rdf-hisco; - update of issues with language tags; - specfication of language tags for english (eg. @en-gb, instead of @en); - new preferred API: https://api.druid.datalegend.net/datasets/HistoryOfWork/historyOfWork-all-latest/sparql (old API will be deprecated at some point: https://api.druid.datalegend.net/datasets/HistoryOfWork/historyOfWork-all-latest/services/historyOfWork-all-latest/sparql ) .

    There are bound to be some issues. Please leave report them here.

    Figure 1. Part of model illustrating the basic relation between occupations, schema.org and HISCO. https://druid.datalegend.net/HistoryOfWork/historyOfWork-all-latest/assets/601beed0f7d371035bca5521" alt="hisco-basic">

    Figure 2. Part of model illustrating the relation between occupation, provenance and HISCO auxiliary variables. https://druid.datalegend.net/HistoryOfWork/historyOfWork-all-latest/assets/601beed0f7d371035bca551e" alt="hisco-aux">

  15. C

    Raw Data for ConfLab: A Data Collection Concept, Dataset, and Benchmark for...

    • data.4tu.nl
    Updated Jun 7, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chirag Raman; Jose Vargas Quiros; Stephanie Tan; Ashraful Islam; Ekin Gedik; Hayley Hung (2022). Raw Data for ConfLab: A Data Collection Concept, Dataset, and Benchmark for Machine Analysis of Free-Standing Social Interactions in the Wild [Dataset]. http://doi.org/10.4121/20017748.v2
    Explore at:
    Dataset updated
    Jun 7, 2022
    Dataset provided by
    4TU.ResearchData
    Authors
    Chirag Raman; Jose Vargas Quiros; Stephanie Tan; Ashraful Islam; Ekin Gedik; Hayley Hung
    License

    https://data.4tu.nl/info/fileadmin/user_upload/Documenten/4TU.ResearchData_Restricted_Data_2022.pdfhttps://data.4tu.nl/info/fileadmin/user_upload/Documenten/4TU.ResearchData_Restricted_Data_2022.pdf

    Description

    This file contains raw data for cameras and wearables of the ConfLab dataset.


    ./cameras

    contains the overhead video recordings for 9 cameras (cam2-10) in MP4 files.

    These cameras cover the whole interaction floor, with camera 2 capturing the

    bottom of the scene layout, and camera 10 capturing top of the scene layout.

    Note that cam5 ran out of battery before the other cameras and thus the recordings

    are cut short. However, cam4 and 6 contain significant overlap with cam 5, to

    reconstruct any information needed.


    Note that the annotations are made and provided in 2 minute segments.

    The annotated portions of the video include the last 3min38sec of x2xxx.MP4

    video files, and the first 12 min of x3xxx.MP4 files for cameras (2,4,6,8,10),

    with "x" being the placeholder character in the mp4 file names. If one wishes

    to separate the video into 2 min segments as we did, the "video-splitting.sh"

    script is provided.


    ./camera-calibration contains the camera instrinsic files obtained from

    https://github.com/idiap/multicamera-calibration. Camera extrinsic parameters can

    be calculated using the existing intrinsic parameters and the instructions in the

    multicamera-calibration repo. The coordinates in the image are provided by the

    crosses marked on the floor, which are visible in the video recordings.

    The crosses are 1m apart (=100cm).


    ./wearables

    subdirectory includes the IMU, proximity and audio data from each

    participant at the Conflab event (48 in total). In the directory numbered

    by participant ID, the following data are included:

    1. raw audio file

    2. proximity (bluetooth) pings (RSSI) file (raw and csv) and a visualization

    3. Tri-axial accelerometer data (raw and csv) and a visualization

    4. Tri-axial gyroscope data (raw and csv) and a visualization

    5. Tri-axial magnetometer data (raw and csv) and a visualization

    6. Game rotation vector (raw and csv), recorded in quaternions.


    All files are timestamped.

    The sampling frequencies are:

    - audio: 1250 Hz

    - rest: around 50Hz. However, the sample rate is not fixed

    and instead the timestamps should be used.


    For rotation, the game rotation vector's output frequency is limited by the

    actual sampling frequency of the magnetometer. For more information, please refer to

    https://invensense.tdk.com/wp-content/uploads/2016/06/DS-000189-ICM-20948-v1.3.pdf


    Audio files in this folder are in raw binary form. The following can be used to convert

    them to WAV files (1250Hz):


    ffmpeg -f s16le -ar 1250 -ac 1 -i /path/to/audio/file


    Synchronization of cameras and werables data

    Raw videos contain timecode information which matches the timestamps of the data in

    the "wearables" folder. The starting timecode of a video can be read as:

    ffprobe -hide_banner -show_streams -i /path/to/video


    ./audio

    ./sync: contains wav files per each subject

    ./sync_files: auxiliary csv files used to sync the audio. Can be used to improve the synchronization.

    The code used for syncing the audio can be found here:

    https://github.com/TUDelft-SPC-Lab/conflab/tree/master/preprocessing/audio

  16. Industrial Dataset

    • kaggle.com
    Updated May 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Be Schue (2023). Industrial Dataset [Dataset]. https://www.kaggle.com/datasets/beschue/industrial-classification-data-set
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 8, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Be Schue
    Description

    The dataset includes 10 object categories from the MVTEC INDUSTRIAL 3D OBJECT DETECTION DATASET as input CAD objects. The selected objects include a diverse range of industrial products:

    S.NoObject Class
    1adapter plate triangular
    2bracket big
    3clamp small
    4engine part cooler round
    5engine part cooler square
    6injection pump
    7screw
    8star
    9tee connector
    10thread

    The dataset contains a total of 100,000 RGB images of each object category, divided into three sets: 70,000 for training, 20,000 for testing, and 10,000 for validation. Each image has a resolution of 224 x 224 and is in JPEG format.

    To ensure the suitability of our dataset for various computer vision tasks, we included not only the class labels but also generated bounding boxes and semantic masks for each image, which are stored in COCO annotation format. Each image contains one instance of the ten selected objects.

    Throughout the 10,000 images for each class, we randomly varied the position of the object in x-y-z direction and the object’s rotation to provide a diverse range of images. Additionally, we changed the object’s surface to a smooth metallic texture, imitating real industrial components. Lastly, we varied the lighting conditions within each image, including the position of the light sources, their energy, and emission strength.

    Find out more about our Data Generation Tool:

    Schuerrle, B., Sankarappan, V., & Morozov, A. (2023). SynthiCAD: Generation of Industrial Image Data Sets for Resilience Evaluation of Safety-Critical Classifiers. In Proceeding of the 33rd European Safety and Reliability Conference. 33rd European Safety and Reliability Conference. Research Publishing Services. https://doi.org/10.3850/978-981-18-8071-1_p400-cd

  17. Dataset relating a study on Geospatial Open Data usage and metadata quality

    • zenodo.org
    • data.niaid.nih.gov
    Updated Jun 19, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alfonso Quarati; Alfonso Quarati; Monica De Martino; Monica De Martino (2023). Dataset relating a study on Geospatial Open Data usage and metadata quality [Dataset]. http://doi.org/10.5281/zenodo.4280594
    Explore at:
    Dataset updated
    Jun 19, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Alfonso Quarati; Alfonso Quarati; Monica De Martino; Monica De Martino
    Description

    The Open Government Data portals (OGD) thanks to the presence of thousands of geo-referenced datasets, containing spatial information, are of extreme interest for any analysis or process relating to the territory. For this to happen, users must be enabled to access these datasets and reuse them. An element often considered hindering the full dissemination of OGD data is the quality of their metadata. Starting from an experimental investigation conducted on over 160,000 geospatial datasets belonging to six national and international OGD portals, this work has as its first objective to provide an overview of the usage of these portals measured in terms of datasets views and downloads. Furthermore, to assess the possible influence of the quality of the metadata on the use of geospatial datasets, an assessment of the metadata for each dataset was carried out, and the correlation between these two variables was measured. The results obtained showed a significant underutilization of geospatial datasets and a generally poor quality of their metadata. Besides, a weak correlation was found between the use and quality of the metadata, not such as to assert with certainty that the latter is a determining factor of the former.

    The dataset consists of six zipped CSV files, containing the collected datasets' usage data, full metadata, and computed quality values, for about 160,000 geospatial datasets belonging to the three national and three international portals considered in the study, i.e. US (catalog.data.gov), Colombia (datos.gov.co), Ireland (data.gov.ie), HDX (data.humdata.org), EUODP (data.europa.eu), and NASA (data.nasa.gov).

    Data collection occurred in the period: 2019-12-19 -- 2019-12-23.

    The header for each CSV file is:

    [ ,portalid,id,downloaddate,metadata,overallq,qvalues,assessdate,dviews,downloads,engine,admindomain]

    where for each row (a portal's dataset) the following fields are defined as follows:

    • portalid: portal identifier
    • id: dataset identifier
    • downloaddate: date of data collection
    • metadata: the overall dataset's metadata downloaded via API from the portal according to the supporting platform schema
    • overallq: overall quality values computed by applying the methodology presented in [1]
    • qvalues: json object containing the quality values computed for the 17 metrics presented in [1]
    • assessdate: date of quality assessment
    • dviews: number of total views for the dataset
    • downloads: number of total downloads for the dataset (made available only by the Colombia, HDX, and NASA portals)
    • engine: identifier of the supporting portal platform: 1(CKAN), 2 (Socrata)
    • admindomain: 1 (national), 2 (international)

    [1] Neumaier, S.; Umbrich, J.; Polleres, A. Automated Quality Assessment of Metadata Across Open Data Portals.J. Data and Information Quality2016,8, 2:1–2:29. doi:10.1145/2964909

  18. e

    Accessibility Destination Datasets

    • data.europa.eu
    unknown
    Updated Aug 18, 2011
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department for Transport (2011). Accessibility Destination Datasets [Dataset]. https://data.europa.eu/data/datasets/accessibility-destination-datasets?locale=en
    Explore at:
    unknownAvailable download formats
    Dataset updated
    Aug 18, 2011
    Dataset authored and provided by
    Department for Transport
    License

    http://reference.data.gov.uk/id/open-government-licencehttp://reference.data.gov.uk/id/open-government-licence

    Description

    Excel datasets containing raw destination data for calculating Accessibility statistics. This gives the locations of the different services used within these calculations: Primary schools, Secondary Schools, Further Education, Hospitals, GPs, Town Centres, Employment Centres.

    The Food Stores data, and the 2010 GP and Hospitals data used in the accessibility statistics calculations come from commercial dataset and cannot be made available for reuse.

  19. P

    ImageNet-Sketch Dataset

    • paperswithcode.com
    Updated Oct 23, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Haohan Wang; Songwei Ge; Eric P. Xing; Zachary C. Lipton (2022). ImageNet-Sketch Dataset [Dataset]. https://paperswithcode.com/dataset/imagenet-sketch
    Explore at:
    Dataset updated
    Oct 23, 2022
    Authors
    Haohan Wang; Songwei Ge; Eric P. Xing; Zachary C. Lipton
    Description

    ImageNet-Sketch data set consists of 50,889 images, approximately 50 images for each of the 1000 ImageNet classes. The data set is constructed with Google Image queries "sketch of ", where is the standard class name. Only within the "black and white" color scheme is searched. 100 images are initially queried for every class, and the pulled images are cleaned by deleting the irrelevant images and images that are for similar but different classes. For some classes, there are less than 50 images after manually cleaning, and then the data set is augmented by flipping and rotating the images.

  20. u

    Amazon review data 2018

    • cseweb.ucsd.edu
    • nijianmo.github.io
    • +1more
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UCSD CSE Research Project, Amazon review data 2018 [Dataset]. https://cseweb.ucsd.edu/~jmcauley/datasets/amazon_v2/
    Explore at:
    Dataset authored and provided by
    UCSD CSE Research Project
    Description

    Context

    This Dataset is an updated version of the Amazon review dataset released in 2014. As in the previous version, this dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs). In addition, this version provides the following features:

    • More reviews:

      • The total number of reviews is 233.1 million (142.8 million in 2014).
    • New reviews:

      • Current data includes reviews in the range May 1996 - Oct 2018.
    • Metadata: - We have added transaction metadata for each review shown on the review page.

      • Added more detailed metadata of the product landing page.

    Acknowledgements

    If you publish articles based on this dataset, please cite the following paper:

    • Jianmo Ni, Jiacheng Li, Julian McAuley. Justifying recommendations using distantly-labeled reviews and fined-grained aspects. EMNLP, 2019.
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Microsoft (2024). FStarDataSet-V2 [Dataset]. https://huggingface.co/datasets/microsoft/FStarDataSet-V2
Organization logo

FStarDataSet-V2

PoPAI-FStarDataSet-V2

microsoft/FStarDataSet-V2

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 4, 2024
Dataset authored and provided by
Microsofthttp://microsoft.com/
License

https://choosealicense.com/licenses/cdla-permissive-2.0/https://choosealicense.com/licenses/cdla-permissive-2.0/

Description

This dataset is the Version 2.0 of microsoft/FStarDataSet.

  Primary-Objective

This dataset's primary objective is to train and evaluate Proof-oriented Programming with AI (PoPAI, in short). Given a specification of a program and proof in F*, the objective of a AI model is to synthesize the implemantation (see below for details about the usage of this dataset, including the input and output).

  Data Format

Each of the examples in this dataset are organized as dictionaries… See the full description on the dataset page: https://huggingface.co/datasets/microsoft/FStarDataSet-V2.

Search
Clear search
Close search
Google apps
Main menu