100+ datasets found

FStarDataSet-V2
huggingface.co
Updated Sep 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Microsoft (2024). FStarDataSet-V2 [Dataset]. https://huggingface.co/datasets/microsoft/FStarDataSet-V2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 4, 2024
Dataset authored and provided by
Microsofthttp://microsoft.com/
License
https://choosealicense.com/licenses/cdla-permissive-2.0/https://choosealicense.com/licenses/cdla-permissive-2.0/
Description
This dataset is the Version 2.0 of microsoft/FStarDataSet.

Primary-Objective

This dataset's primary objective is to train and evaluate Proof-oriented Programming with AI (PoPAI, in short). Given a specification of a program and proof in F*, the objective of a AI model is to synthesize the implemantation (see below for details about the usage of this dataset, including the input and output).

Data Format

Each of the examples in this dataset are organized as dictionaries… See the full description on the dataset page: https://huggingface.co/datasets/microsoft/FStarDataSet-V2.
SECs Compiled Financial Statements & Notes Dataset
kaggle.com
Updated Jul 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Deny Tran (2024). SECs Compiled Financial Statements & Notes Dataset [Dataset]. https://www.kaggle.com/datasets/denytran/im-a-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 31, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Deny Tran
License
https://www.usa.gov/government-works/https://www.usa.gov/government-works/
Description
This dataset is from the SEC's Financial Statements and Notes Data Set.
It was a personal project to see if I could make the queries efficient.
It's just been collecting dust ever since, maybe someone will make good use of it.
Data is up to about early-2024.
It doesn't differ from the source, other than it's compiled - so maybe you can try it out, then compile your own (with the link below).
Dataset was created using SEC Files and SQL Server on Docker.
For details on the SQL Server database this came from, see: "dataset-previous-life-info" folder, which will contain: - Row Counts - Primary/Foreign Keys - SQL Statements to recreate database tables - Example queries on how to join the data tables. - A pretty picture of the table associations. Source: https://www.sec.gov/data-research/financial-statement-notes-data-sets

Happy coding!
d
OpenFEMA Data Set Fields
catalog.data.gov
datasets.ai
Updated Jun 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FEMA/Mission Support/Off of Chf Information Officer (2025). OpenFEMA Data Set Fields [Dataset]. https://catalog.data.gov/dataset/openfema-data-set-fields
Explore at:
Dataset updated
Jun 7, 2025
Dataset provided by
FEMA/Mission Support/Off of Chf Information Officer
Description
Metadata for the OpenFEMA API data set fields. It contains descriptions, data types, and other attributes for each field.rnrnIf you have media inquiries about this dataset please email the FEMA News Desk FEMA-News-Desk@dhs.gov or call (202) 646-3272. For inquiries about FEMA's data and Open government program please contact the OpenFEMA team via email OpenFEMA@fema.dhs.gov.
Dataset - Understanding the software and data used in the social sciences
zenodo.org
eprints.soton.ac.uk
pdf, zip
Updated Jul 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Selina Aragon; Selina Aragon; Mario Antonioletti; Mario Antonioletti; Johanna Walker; Johanna Walker; Neil Chue Hong; Neil Chue Hong (2024). Dataset - Understanding the software and data used in the social sciences [Dataset]. http://doi.org/10.5281/zenodo.7785711
Explore at:
zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7785711
Dataset updated
Jul 12, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Selina Aragon; Selina Aragon; Mario Antonioletti; Mario Antonioletti; Johanna Walker; Johanna Walker; Neil Chue Hong; Neil Chue Hong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is a repository for a UKRI Economic and Social Research Council (ESRC) funded project to understand the software used to analyse social sciences data.
Any software produced has been made available under a BSD 2-Clause license and any data and other non-software derivative is made available under a CC-BY 4.0 International License. Note that the software that analysed the survey is provided for illustrative purposes - it will not work on the decoupled anonymised data set.
Exceptions to this are:
Data from the UKRI ESRC is mostly made available under a CC BY-NC-SA 4.0 Licence.
Data from Gateway to Research is made available under an Open Government Licence (Version 3.0).
Contents
Survey data & analysis: esrc_data-survey-analysis-data.zip
Other data: esrc_data-other-data.zip
Transcripts: esrc_data-transcripts.zip
Data Management Plan: esrc_data-dmp.zip
Survey data & analysis
The survey ran from 3rd February 2022 to 6th March 2023 during which 168 responses were received. Of these responses, three were removed because they were supplied by people from outside the UK without a clear indication of involvement with the UK or associated infrastructure. A fourth response was removed as both came from the same person which leaves us with 164 responses in the data.
The survey responses, Question (Q) Q1-Q16, have been decoupled from the demographic data, Q17-Q23. Questions Q24-Q28 are for follow-up and have been removed from the data. The institutions (Q17) and funding sources (Q18) have been provided in a separate file as this could be used to identify respondents. Q17, Q18 and Q19-Q23 have all been independently shuffled.
The data has been made available as Comma Separated Values (CSV) with the question number as the header of each column and the encoded responses in the column below. To see what the question and the responses correspond to you will have to consult the survey-results-key.csv which decodes the question and responses accordingly.
A pdf copy of the survey questions is available on GitHub.
The survey data has been decoupled into:
survey-results-key.csv - maps a question number and the responses to the actual question values.
q1-16-survey-results.csv- the non-demographic component of the survey responses (Q1-Q16).
q19-23-demographics.csv - the demographic part of the survey (Q19-Q21, Q23).
q17-institutions.csv - the institution/location of the respondent (Q17).
q18-funding.csv - funding sources within the last 5 years (Q18).
Please note the code that has been used to do the analysis will not run with the decoupled survey data.
Other data files included
CleanedLocations.csv - normalised version of the institutions that the survey respondents volunteered.
DTPs.csv - information on the UKRI Doctoral Training Partnerships (DTPs) scaped from the UKRI DTP contacts web page in October 2021.
projectsearch-1646403729132.csv.gz - data snapshot from the UKRI Gateway to Research released on the 24th February 2022 made available under an Open Government Licence.
locations.csv - latitude and longitude for the institutions in the cleaned locations.
subjects.csv - research classifications for the ESRC projects for the 24th February data snapshot.
topics.csv - topic classification for the ESRC projects for the 24th February data snapshot.
Interview transcripts
The interview transcripts have been anonymised and converted to markdown so that it's easier to process in general. List of interview transcripts:
1269794877.md
1578450175.md
1792505583.md
2964377624.md
3270614512.md
40983347262.md
4288358080.md
4561769548.md
4938919540.md
5037840428.md
5766299900.md
5996360861.md
6422621713.md
6776362537.md
7183719943.md
7227322280.md
7336263536.md
75909371872.md
7869268779.md
8031500357.md
9253010492.md
Data Management Plan
The study's Data Management Plan is provided in PDF format and shows the different data sets used throughout the duration of the study and where they have been deposited, as well as how long the SSI will keep these records.
P
KaggleDBQA Dataset
paperswithcode.com
Updated Jan 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chia-Hsuan Lee; Oleksandr Polozov; Matthew Richardson (2025). KaggleDBQA Dataset [Dataset]. https://paperswithcode.com/dataset/kaggledbqa
Explore at:
Dataset updated
Jan 20, 2025
Authors
Chia-Hsuan Lee; Oleksandr Polozov; Matthew Richardson
Description
KaggleDBQA is a challenging cross-domain and complex evaluation dataset of real Web databases, with domain-specific data types, original formatting, and unrestricted questions.

It expands upon contemporary cross-domain text-to-SQL datasets in three key aspects: (1) Its databases are pulled from real-world data sources and not normalized. (2) Its questions are authored in environments that mimic natural question answering. (3) It also provides database documentation that contains rich in-domain knowledge.
Data from: Data Sets for Evaluation of Building Fault Detection and...
osti.gov
data.openei.org
+1more
Updated Feb 26, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lin, Guanjing; Mitchell, Robin (2019). Data Sets for Evaluation of Building Fault Detection and Diagnostics Algorithms [Dataset]. https://www.osti.gov/dataexplorer/biblio/dataset/1824861-data-sets-evaluation-building-fault-detection-diagnostics-algorithms
Explore at:
Dataset updated
Feb 26, 2019
Dataset provided by
United States Department of Energyhttp://energy.gov/
49.2637,-66.5318|24.5873,-66.5318|24.5873,-125.4514|49.2637,-125.4514|49.2637,-66.5318
DOE Open Energy Data Initiative (OEDI); Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Authors
Lin, Guanjing; Mitchell, Robin
Description
This documentation and dataset can be used to test the performance of automated fault detection and diagnostics algorithms for buildings. The dataset was created by LBNL, PNNL, NREL, ORNL and ASHRAE RP-1312 (Drexel University). It includes data for air-handling units and rooftop units simulated with PNNL's large office building model.
h
AI-Generated-vs-Real-Images-Datasets
huggingface.co
Updated Mar 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hem Bahadur Gurung (2025). AI-Generated-vs-Real-Images-Datasets [Dataset]. https://huggingface.co/datasets/Hemg/AI-Generated-vs-Real-Images-Datasets
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 27, 2025
Authors
Hem Bahadur Gurung
Description
Dataset Card for "AI-Generated-vs-Real-Images-Datasets"

More Information needed
High School Heights Dataset
kaggle.com
Updated Aug 11, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yashmeet Singh (2022). High School Heights Dataset [Dataset]. https://www.kaggle.com/datasets/yashmeetsingh/high-school-heights-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 11, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Yashmeet Singh
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
High School Heights Dataset

You will find three datasets containing heights of the high school students.

All heights are in inches.

The data is simulated. The heights are generated from a normal distribution with different sets of mean and standard deviation for boys and girls.

Height Statistics (inches) Boys Girls
Mean 67 62
Standard Deviation 2.9 2.2

There are 500 measurements for each gender.

Here are the datasets:

hs_heights.csv: contains a single column with heights for all boys and girls. There's no way to tell which of the values are for boys and which ones are for girls.

hs_heights_pair.csv: has two columns. The first column has boy's heights. The second column contains girl's heights.

hs_heights_flag.csv: has two columns. The first column has the flag is_girl. The second column contains a girl's height if the flag is 1. Otherwise, it contains a boy's height.

To see how I generated this dataset, check this out: https://github.com/ysk125103/datascience101/tree/main/datasets/high_school_heights

Image by Gillian Callison from Pixabay
LinkedIn Datasets
brightdata.com
.json, .csv, .xlsx
Updated Dec 17, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bright Data (2021). LinkedIn Datasets [Dataset]. https://brightdata.com/products/datasets/linkedin
Explore at:
.json, .csv, .xlsxAvailable download formats
Dataset updated
Dec 17, 2021
Dataset authored and provided by
Bright Datahttps://brightdata.com/
License
https://brightdata.com/licensehttps://brightdata.com/license
Area covered
Worldwide
Description
Unlock the full potential of LinkedIn data with our extensive dataset that combines profiles, company information, and job listings into one powerful resource for business decision-making, strategic hiring, competitive analysis, and market trend insights. This all-encompassing dataset is ideal for professionals, recruiters, analysts, and marketers aiming to enhance their strategies and operations across various business functions. Dataset Features

Profiles: Dive into detailed public profiles featuring names, titles, positions, experience, education, skills, and more. Utilize this data for talent sourcing, lead generation, and investment signaling, with a refresh rate ensuring up to 30 million records per month. Companies: Access comprehensive company data including ID, country, industry, size, number of followers, website details, subsidiaries, and posts. Tailored subsets by industry or region provide invaluable insights for CRM enrichment, competitive intelligence, and understanding the startup ecosystem, updated monthly with up to 40 million records. Job Listings: Explore current job opportunities detailed with job titles, company names, locations, and employment specifics such as seniority levels and employment functions. This dataset includes direct application links and real-time application numbers, serving as a crucial tool for job seekers and analysts looking to understand industry trends and the job market dynamics.

Customizable Subsets for Specific Needs Our LinkedIn dataset offers the flexibility to tailor the dataset according to your specific business requirements. Whether you need comprehensive insights across all data points or are focused on specific segments like job listings, company profiles, or individual professional details, we can customize the dataset to match your needs. This modular approach ensures that you get only the data that is most relevant to your objectives, maximizing efficiency and relevance in your strategic applications. Popular Use Cases

Strategic Hiring and Recruiting: Track talent movement, identify growth opportunities, and enhance your recruiting efforts with targeted data. Market Analysis and Competitive Intelligence: Gain a competitive edge by analyzing company growth, industry trends, and strategic opportunities. Lead Generation and CRM Enrichment: Enrich your database with up-to-date company and professional data for targeted marketing and sales strategies. Job Market Insights and Trends: Leverage detailed job listings for a nuanced understanding of employment trends and opportunities, facilitating effective job matching and market analysis. AI-Driven Predictive Analytics: Utilize AI algorithms to analyze large datasets for predicting industry shifts, optimizing business operations, and enhancing decision-making processes based on actionable data insights.

Whether you are mapping out competitive landscapes, sourcing new talent, or analyzing job market trends, our LinkedIn dataset provides the tools you need to succeed. Customize your access to fit specific needs, ensuring that you have the most relevant and timely data at your fingertips.
m
Data from: Active Sonar Data Set
data.mendeley.com
search.datacite.org
Updated Oct 9, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohammad Khishe (2017). Active Sonar Data Set [Dataset]. http://doi.org/10.17632/fyxjjwzphf.1
Explore at:
Unique identifier
https://doi.org/10.17632/fyxjjwzphf.1
Dataset updated
Oct 9, 2017
Authors
Mohammad Khishe
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In this data set, 6 objects including 2 targets and 4 non-targets lay on the sea sand bottom. Upon this experiment, the transmitted signal is Wide-Band Linear Frequency Modulated Pulse (WLFM) which covers frequency range 5-110 KHz. Targets lay on the bottom rotate 180 degrees with 1 degree accuracy via electromotor. Off target to 10 meters backscattered echoes are accumulated. Fine dataset takes key role in sonar target classification. Regarding massive raw data obtained from previous stage, above massive calculation will be expected. To reduce calculation burden relating to classifying and extracting feature, it is essential to detect targets out of total received data. To implement this, the intensity of the received signal is used. It is inevitable to consider multi-path propagation, secondary reflections, and reverberation due to shoal of the region. The researcher attempts to eliminate artifact tract after detecting stage and before extracting feature by the use of a matched filter.
Instagram Dataset
brightdata.com
.json, .csv, .xlsx
Updated Apr 26, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bright Data (2022). Instagram Dataset [Dataset]. https://brightdata.com/products/datasets/instagram
Explore at:
.json, .csv, .xlsxAvailable download formats
Dataset updated
Apr 26, 2022
Dataset authored and provided by
Bright Datahttps://brightdata.com/
License
https://brightdata.com/licensehttps://brightdata.com/license
Area covered
Worldwide
Description
Use our Instagram dataset (public data) to extract business and non-business information from complete public profiles and filter by hashtags, followers, account type, or engagement score. Depending on your needs, you may purchase the entire dataset or a customized subset. Popular use cases include sentiment analysis, brand monitoring, influencer marketing, and more. The dataset includes all major data points: # of followers, verified status, account type (business / non-business), links, posts, comments, location, engagement score, hashtags, and much more.
I
Cline Center Coup d’État Project Dataset
databank.illinois.edu
Updated May 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Buddy Peyton; Joseph Bajjalieh; Dan Shalmon; Michael Martin; Emilio Soto (2025). Cline Center Coup d’État Project Dataset [Dataset]. http://doi.org/10.13012/B2IDB-9651987_V7
Explore at:
Unique identifier
https://doi.org/10.13012/B2IDB-9651987_V7
Dataset updated
May 11, 2025
Authors
Buddy Peyton; Joseph Bajjalieh; Dan Shalmon; Michael Martin; Emilio Soto
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Coups d'Ètat are important events in the life of a country. They constitute an important subset of irregular transfers of political power that can have significant and enduring consequences for national well-being. There are only a limited number of datasets available to study these events (Powell and Thyne 2011, Marshall and Marshall 2019). Seeking to facilitate research on post-WWII coups by compiling a more comprehensive list and categorization of these events, the Cline Center for Advanced Social Research (previously the Cline Center for Democracy) initiated the Coup d’État Project as part of its Societal Infrastructures and Development (SID) project. More specifically, this dataset identifies the outcomes of coup events (i.e., realized, unrealized, or conspiracy) the type of actor(s) who initiated the coup (i.e., military, rebels, etc.), as well as the fate of the deposed leader. Version 2.1.3 adds 19 additional coup events to the data set, corrects the date of a coup in Tunisia, and reclassifies an attempted coup in Brazil in December 2022 to a conspiracy. Version 2.1.2 added 6 additional coup events that occurred in 2022 and updated the coding of an attempted coup event in Kazakhstan in January 2022. Version 2.1.1 corrected a mistake in version 2.1.0, where the designation of “dissident coup” had been dropped in error for coup_id: 00201062021. Version 2.1.1 fixed this omission by marking the case as both a dissident coup and an auto-coup. Version 2.1.0 added 36 cases to the data set and removed two cases from the v2.0.0 data. This update also added actor coding for 46 coup events and added executive outcomes to 18 events from version 2.0.0. A few other changes were made to correct inconsistencies in the coup ID variable and the date of the event. Version 2.0.0 improved several aspects of the previous version (v1.0.0) and incorporated additional source material to include: • Reconciling missing event data • Removing events with irreconcilable event dates • Removing events with insufficient sourcing (each event needs at least two sources) • Removing events that were inaccurately coded as coup events • Removing variables that fell below the threshold of inter-coder reliability required by the project • Removing the spreadsheet ‘CoupInventory.xls’ because of inadequate attribution and citations in the event summaries • Extending the period covered from 1945-2005 to 1945-2019 • Adding events from Powell and Thyne’s Coup Data (Powell and Thyne, 2011)
Items in this Dataset 1. Cline Center Coup d'État Codebook v.2.1.3 Codebook.pdf - This 15-page document describes the Cline Center Coup d’État Project dataset. The first section of this codebook provides a summary of the different versions of the data. The second section provides a succinct definition of a coup d’état used by the Coup d'État Project and an overview of the categories used to differentiate the wide array of events that meet the project's definition. It also defines coup outcomes. The third section describes the methodology used to produce the data. Revised February 2024 2. Coup Data v2.1.3.csv - This CSV (Comma Separated Values) file contains all of the coup event data from the Cline Center Coup d’État Project. It contains 29 variables and 1000 observations. Revised February 2024 3. Source Document v2.1.3.pdf - This 325-page document provides the sources used for each of the coup events identified in this dataset. Please use the value in the coup_id variable to identify the sources used to identify that particular event. Revised February 2024 4. README.md - This file contains useful information for the user about the dataset. It is a text file written in markdown language. Revised February 2024
Citation Guidelines 1. To cite the codebook (or any other documentation associated with the Cline Center Coup d’État Project Dataset) please use the following citation: Peyton, Buddy, Joseph Bajjalieh, Dan Shalmon, Michael Martin, Jonathan Bonaguro, and Scott Althaus. 2024. “Cline Center Coup d’État Project Dataset Codebook”. Cline Center Coup d’État Project Dataset. Cline Center for Advanced Social Research. V.2.1.3. February 27. University of Illinois Urbana-Champaign. doi: 10.13012/B2IDB-9651987_V7 2. To cite data from the Cline Center Coup d’État Project Dataset please use the following citation (filling in the correct date of access): Peyton, Buddy, Joseph Bajjalieh, Dan Shalmon, Michael Martin, Jonathan Bonaguro, and Emilio Soto. 2024. Cline Center Coup d’État Project Dataset. Cline Center for Advanced Social Research. V.2.1.3. February 27. University of Illinois Urbana-Champaign. doi: 10.13012/B2IDB-9651987_V7
d
Project Management
catalog.data.gov
datasets.ai
+1more
Updated May 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Office of Project Management (2025). Project Management [Dataset]. https://catalog.data.gov/dataset/project-management
Explore at:
Dataset updated
May 2, 2025
Dataset provided by
Office of Project Management
Description
the Department of Energy’s Enterprise Project Management Organization (EPMO), providing leadership and assistance in developing and implementing DOE-wide policies, procedures, programs, and management systems pertaining to project management, and independently monitors, assesses, and reports on project execution performance. The office validates project performance baselines–scope, cost and schedule–of the Department’s largest construction and environmental clean-up projects prior to budget request to Congress—an active project portfolio totaling over $30 billion. The office also serves as Executive Secretariat for the Department’s Energy Systems Acquisition Advisory Board (ESAAB) and the Project Management Risk Committee (PMRC). In these capacities, the Director is accountable to the Deputy Secretary.
D
History of work (all graph datasets)
druid.datalegend.net
iisg.amsterdam
application/n-quads +5
Updated Apr 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
History of Work (2025). History of work (all graph datasets) [Dataset]. https://druid.datalegend.net/HistoryOfWork/historyOfWork-all-latest
Explore at:
application/n-quads, application/n-triples, application/trig, ttl, jsonld, application/sparql-results+jsonAvailable download formats
Dataset updated
Apr 18, 2025
Dataset authored and provided by
History of Work
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
History of Work

Here you find the History of Work resources as Linked Open Data. It enables you to look ups for HISCO and HISCAM scores for an incredible amount of occupational titles in numerous languages.

Data can be queried (obtained) via the SPARQL endpoint or via the example queries. If the Linked Open Data format is new to you, you might enjoy these data stories on History of Work as Linked Open Data and this user question on Is there a list of female occupations?.

NEW version - CHANGE notes

This version is dated Apr 2025 and is not backwards compatible with the previous version (Feb 2021). The major changes are: - incredible simplification of graph representation (from 81 to 12); - use of sdo (https://schema.org/) rather than schema (http://schema.org); - replacement of prov:wasDerivedFrom with sdo:isPartOf to link occupational titles to originating datasets; - etl files (used for conversion to Linked Data) now publicly available via https://github.com/rlzijdeman/rdf-hisco; - update of issues with language tags; - specfication of language tags for english (eg. @en-gb, instead of @en); - new preferred API: https://api.druid.datalegend.net/datasets/HistoryOfWork/historyOfWork-all-latest/sparql (old API will be deprecated at some point: https://api.druid.datalegend.net/datasets/HistoryOfWork/historyOfWork-all-latest/services/historyOfWork-all-latest/sparql ) .

There are bound to be some issues. Please leave report them here.

Figure 1. Part of model illustrating the basic relation between occupations, schema.org and HISCO. https://druid.datalegend.net/HistoryOfWork/historyOfWork-all-latest/assets/601beed0f7d371035bca5521" alt="hisco-basic">

Figure 2. Part of model illustrating the relation between occupation, provenance and HISCO auxiliary variables. https://druid.datalegend.net/HistoryOfWork/historyOfWork-all-latest/assets/601beed0f7d371035bca551e" alt="hisco-aux">
C
Raw Data for ConfLab: A Data Collection Concept, Dataset, and Benchmark for...
data.4tu.nl
Updated Jun 7, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chirag Raman; Jose Vargas Quiros; Stephanie Tan; Ashraful Islam; Ekin Gedik; Hayley Hung (2022). Raw Data for ConfLab: A Data Collection Concept, Dataset, and Benchmark for Machine Analysis of Free-Standing Social Interactions in the Wild [Dataset]. http://doi.org/10.4121/20017748.v2
Explore at:
Unique identifier
https://doi.org/10.4121/20017748.v2
Dataset updated
Jun 7, 2022
Dataset provided by
4TU.ResearchData
Authors
Chirag Raman; Jose Vargas Quiros; Stephanie Tan; Ashraful Islam; Ekin Gedik; Hayley Hung
License
https://data.4tu.nl/info/fileadmin/user_upload/Documenten/4TU.ResearchData_Restricted_Data_2022.pdfhttps://data.4tu.nl/info/fileadmin/user_upload/Documenten/4TU.ResearchData_Restricted_Data_2022.pdf
Description
This file contains raw data for cameras and wearables of the ConfLab dataset.

./cameras

contains the overhead video recordings for 9 cameras (cam2-10) in MP4 files.

These cameras cover the whole interaction floor, with camera 2 capturing the

bottom of the scene layout, and camera 10 capturing top of the scene layout.

Note that cam5 ran out of battery before the other cameras and thus the recordings

are cut short. However, cam4 and 6 contain significant overlap with cam 5, to

reconstruct any information needed.

Note that the annotations are made and provided in 2 minute segments.

The annotated portions of the video include the last 3min38sec of x2xxx.MP4

video files, and the first 12 min of x3xxx.MP4 files for cameras (2,4,6,8,10),

with "x" being the placeholder character in the mp4 file names. If one wishes

to separate the video into 2 min segments as we did, the "video-splitting.sh"

script is provided.

./camera-calibration contains the camera instrinsic files obtained from

https://github.com/idiap/multicamera-calibration. Camera extrinsic parameters can

be calculated using the existing intrinsic parameters and the instructions in the

multicamera-calibration repo. The coordinates in the image are provided by the

crosses marked on the floor, which are visible in the video recordings.

The crosses are 1m apart (=100cm).

./wearables

subdirectory includes the IMU, proximity and audio data from each

participant at the Conflab event (48 in total). In the directory numbered

by participant ID, the following data are included:

1. raw audio file

2. proximity (bluetooth) pings (RSSI) file (raw and csv) and a visualization

3. Tri-axial accelerometer data (raw and csv) and a visualization

4. Tri-axial gyroscope data (raw and csv) and a visualization

5. Tri-axial magnetometer data (raw and csv) and a visualization

6. Game rotation vector (raw and csv), recorded in quaternions.

All files are timestamped.

The sampling frequencies are:

- audio: 1250 Hz

- rest: around 50Hz. However, the sample rate is not fixed

and instead the timestamps should be used.

For rotation, the game rotation vector's output frequency is limited by the

actual sampling frequency of the magnetometer. For more information, please refer to

https://invensense.tdk.com/wp-content/uploads/2016/06/DS-000189-ICM-20948-v1.3.pdf

Audio files in this folder are in raw binary form. The following can be used to convert

them to WAV files (1250Hz):

ffmpeg -f s16le -ar 1250 -ac 1 -i /path/to/audio/file

Synchronization of cameras and werables data

Raw videos contain timecode information which matches the timestamps of the data in

the "wearables" folder. The starting timecode of a video can be read as:

ffprobe -hide_banner -show_streams -i /path/to/video

./audio

./sync: contains wav files per each subject

./sync_files: auxiliary csv files used to sync the audio. Can be used to improve the synchronization.

The code used for syncing the audio can be found here:

https://github.com/TUDelft-SPC-Lab/conflab/tree/master/preprocessing/audio
Industrial Dataset
kaggle.com
Updated May 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Be Schue (2023). Industrial Dataset [Dataset]. https://www.kaggle.com/datasets/beschue/industrial-classification-data-set
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 8, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Be Schue
Description
The dataset includes 10 object categories from the MVTEC INDUSTRIAL 3D OBJECT DETECTION DATASET as input CAD objects. The selected objects include a diverse range of industrial products:

S.No Object Class
1 adapter plate triangular
2 bracket big
3 clamp small
4 engine part cooler round
5 engine part cooler square
6 injection pump
7 screw
8 star
9 tee connector
10 thread

The dataset contains a total of 100,000 RGB images of each object category, divided into three sets: 70,000 for training, 20,000 for testing, and 10,000 for validation. Each image has a resolution of 224 x 224 and is in JPEG format.

To ensure the suitability of our dataset for various computer vision tasks, we included not only the class labels but also generated bounding boxes and semantic masks for each image, which are stored in COCO annotation format. Each image contains one instance of the ten selected objects.

Throughout the 10,000 images for each class, we randomly varied the position of the object in x-y-z direction and the object’s rotation to provide a diverse range of images. Additionally, we changed the object’s surface to a smooth metallic texture, imitating real industrial components. Lastly, we varied the lighting conditions within each image, including the position of the light sources, their energy, and emission strength.

Find out more about our Data Generation Tool:

Schuerrle, B., Sankarappan, V., & Morozov, A. (2023). SynthiCAD: Generation of Industrial Image Data Sets for Resilience Evaluation of Safety-Critical Classifiers. In Proceeding of the 33rd European Safety and Reliability Conference. 33rd European Safety and Reliability Conference. Research Publishing Services. https://doi.org/10.3850/978-981-18-8071-1_p400-cd
Dataset relating a study on Geospatial Open Data usage and metadata quality
zenodo.org
data.niaid.nih.gov
Updated Jun 19, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alfonso Quarati; Alfonso Quarati; Monica De Martino; Monica De Martino (2023). Dataset relating a study on Geospatial Open Data usage and metadata quality [Dataset]. http://doi.org/10.5281/zenodo.4280594
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.4280594
Dataset updated
Jun 19, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Alfonso Quarati; Alfonso Quarati; Monica De Martino; Monica De Martino
Description
The Open Government Data portals (OGD) thanks to the presence of thousands of geo-referenced datasets, containing spatial information, are of extreme interest for any analysis or process relating to the territory. For this to happen, users must be enabled to access these datasets and reuse them. An element often considered hindering the full dissemination of OGD data is the quality of their metadata. Starting from an experimental investigation conducted on over 160,000 geospatial datasets belonging to six national and international OGD portals, this work has as its first objective to provide an overview of the usage of these portals measured in terms of datasets views and downloads. Furthermore, to assess the possible influence of the quality of the metadata on the use of geospatial datasets, an assessment of the metadata for each dataset was carried out, and the correlation between these two variables was measured. The results obtained showed a significant underutilization of geospatial datasets and a generally poor quality of their metadata. Besides, a weak correlation was found between the use and quality of the metadata, not such as to assert with certainty that the latter is a determining factor of the former.

The dataset consists of six zipped CSV files, containing the collected datasets' usage data, full metadata, and computed quality values, for about 160,000 geospatial datasets belonging to the three national and three international portals considered in the study, i.e. US (catalog.data.gov), Colombia (datos.gov.co), Ireland (data.gov.ie), HDX (data.humdata.org), EUODP (data.europa.eu), and NASA (data.nasa.gov).

Data collection occurred in the period: 2019-12-19 -- 2019-12-23.

The header for each CSV file is:

[ ,portalid,id,downloaddate,metadata,overallq,qvalues,assessdate,dviews,downloads,engine,admindomain]

where for each row (a portal's dataset) the following fields are defined as follows:

portalid: portal identifier

id: dataset identifier

downloaddate: date of data collection

metadata: the overall dataset's metadata downloaded via API from the portal according to the supporting platform schema

overallq: overall quality values computed by applying the methodology presented in [1]

qvalues: json object containing the quality values computed for the 17 metrics presented in [1]

assessdate: date of quality assessment

dviews: number of total views for the dataset

downloads: number of total downloads for the dataset (made available only by the Colombia, HDX, and NASA portals)

engine: identifier of the supporting portal platform: 1(CKAN), 2 (Socrata)

admindomain: 1 (national), 2 (international)

[1] Neumaier, S.; Umbrich, J.; Polleres, A. Automated Quality Assessment of Metadata Across Open Data Portals.J. Data and Information Quality2016,8, 2:1–2:29. doi:10.1145/2964909
e
Accessibility Destination Datasets
data.europa.eu
unknown
Updated Aug 18, 2011
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department for Transport (2011). Accessibility Destination Datasets [Dataset]. https://data.europa.eu/data/datasets/accessibility-destination-datasets?locale=en
Explore at:
unknownAvailable download formats
Dataset updated
Aug 18, 2011
Dataset authored and provided by
Department for Transport
License
http://reference.data.gov.uk/id/open-government-licencehttp://reference.data.gov.uk/id/open-government-licence
Description
Excel datasets containing raw destination data for calculating Accessibility statistics. This gives the locations of the different services used within these calculations: Primary schools, Secondary Schools, Further Education, Hospitals, GPs, Town Centres, Employment Centres.

The Food Stores data, and the 2010 GP and Hospitals data used in the accessibility statistics calculations come from commercial dataset and cannot be made available for reuse.
P
ImageNet-Sketch Dataset
paperswithcode.com
Updated Oct 23, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Haohan Wang; Songwei Ge; Eric P. Xing; Zachary C. Lipton (2022). ImageNet-Sketch Dataset [Dataset]. https://paperswithcode.com/dataset/imagenet-sketch
Explore at:
Dataset updated
Oct 23, 2022
Authors
Haohan Wang; Songwei Ge; Eric P. Xing; Zachary C. Lipton
Description
ImageNet-Sketch data set consists of 50,889 images, approximately 50 images for each of the 1000 ImageNet classes. The data set is constructed with Google Image queries "sketch of ", where is the standard class name. Only within the "black and white" color scheme is searched. 100 images are initially queried for every class, and the pulled images are cleaned by deleting the irrelevant images and images that are for similar but different classes. For some classes, there are less than 50 images after manually cleaning, and then the data set is augmented by flipping and rotating the images.
u
Amazon review data 2018
cseweb.ucsd.edu
nijianmo.github.io
+1more
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCSD CSE Research Project, Amazon review data 2018 [Dataset]. https://cseweb.ucsd.edu/~jmcauley/datasets/amazon_v2/
Explore at:
Dataset authored and provided by
UCSD CSE Research Project
Description
Context

This Dataset is an updated version of the Amazon review dataset released in 2014. As in the previous version, this dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs). In addition, this version provides the following features:

More reviews:

The total number of reviews is 233.1 million (142.8 million in 2014).

New reviews:

Current data includes reviews in the range May 1996 - Oct 2018.

Metadata: - We have added transaction metadata for each review shown on the review page.

Added more detailed metadata of the product landing page.

Acknowledgements

If you publish articles based on this dataset, please cite the following paper:

Jianmo Ni, Jiacheng Li, Julian McAuley. Justifying recommendations using distantly-labeled reviews and fined-grained aspects. EMNLP, 2019.

Height Statistics (inches)	Boys	Girls
Mean	67	62
Standard Deviation	2.9	2.2

S.No	Object Class
1	adapter plate triangular
2	bracket big
3	clamp small
4	engine part cooler round
5	engine part cooler square
6	injection pump
7	screw
8	star
9	tee connector
10	thread

Facebook

Twitter

Click to copy link

Link copied

Cite

Microsoft (2024). FStarDataSet-V2 [Dataset]. https://huggingface.co/datasets/microsoft/FStarDataSet-V2

FStarDataSet-V2

PoPAI-FStarDataSet-V2

microsoft/FStarDataSet-V2

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Sep 4, 2024

Dataset authored and provided by

Microsofthttp://microsoft.com/

License

https://choosealicense.com/licenses/cdla-permissive-2.0/https://choosealicense.com/licenses/cdla-permissive-2.0/

Description

This dataset is the Version 2.0 of microsoft/FStarDataSet.

  Primary-Objective

This dataset's primary objective is to train and evaluate Proof-oriented Programming with AI (PoPAI, in short). Given a specification of a program and proof in F*, the objective of a AI model is to synthesize the implemantation (see below for details about the usage of this dataset, including the input and output).

  Data Format

Each of the examples in this dataset are organized as dictionaries… See the full description on the dataset page: https://huggingface.co/datasets/microsoft/FStarDataSet-V2.

Clear search

Close search

Google apps

Main menu

FStarDataSet-V2

SECs Compiled Financial Statements & Notes Dataset

OpenFEMA Data Set Fields

Dataset - Understanding the software and data used in the social sciences

Contents

Survey data & analysis

Other data files included

Interview transcripts

Data Management Plan

KaggleDBQA Dataset

Data from: Data Sets for Evaluation of Building Fault Detection and...

AI-Generated-vs-Real-Images-Datasets

High School Heights Dataset

High School Heights Dataset

LinkedIn Datasets

Data from: Active Sonar Data Set

Instagram Dataset

Cline Center Coup d’État Project Dataset

Project Management

History of work (all graph datasets)

History of Work

NEW version - CHANGE notes

Raw Data for ConfLab: A Data Collection Concept, Dataset, and Benchmark for...

Industrial Dataset

Dataset relating a study on Geospatial Open Data usage and metadata quality

Accessibility Destination Datasets

ImageNet-Sketch Dataset

Amazon review data 2018

Context

Acknowledgements

FStarDataSet-V2

PoPAI-FStarDataSet-V2

microsoft/FStarDataSet-V2