MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
A realistic, large-scale synthetic dataset of 10,000 students designed to analyze factors affecting college placements.
This dataset simulates the academic and professional profiles of 10,000 college students, focusing on factors that influence placement outcomes. It includes features like IQ, academic performance, CGPA, internships, communication skills, and more.
The dataset is ideal for:
Column Name | Description |
---|---|
College_ID | Unique ID of the college (e.g., CLG0001 to CLG0100) |
IQ | Student’s IQ score (normally distributed around 100) |
Prev_Sem_Result | GPA from the previous semester (range: 5.0 to 10.0) |
CGPA | Cumulative Grade Point Average (range: ~5.0 to 10.0) |
Academic_Performance | Annual academic rating (scale: 1 to 10) |
Internship_Experience | Whether the student has completed any internship (Yes/No) |
Extra_Curricular_Score | Involvement in extracurriculars (score from 0 to 10) |
Communication_Skills | Soft skill rating (scale: 1 to 10) |
Projects_Completed | Number of academic/technical projects completed (0 to 5) |
Placement | Final placement result (Yes = Placed, No = Not Placed) |
This dataset was generated to resemble real-world data in academic institutions for research and machine learning use. While it is synthetic, the variables and relationships are crafted to mimic authentic trends observed in student placements.
MIT
Created using Python (NumPy, Pandas) with data logic designed for educational and ML experimentation purposes.
The dataset contains the listed points (on the ground or on buildings) extracted from the Regional Numerical Technical Charter (CTRN) on the scale 1:10,000 acquired by the Map Service of the Piedmont Region starting from air flights operated from 1991 to 2005.The data can be downloaded according to the cut of the Sheets at the 1:50,000 scale.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Please note that this is the original dataset with additional information and proper attribution. There is at least one other version of this dataset on Kaggle that was uploaded without permission. Please be fair and attribute the original author. This synthetic dataset is modeled after an existing milling machine and consists of 10 000 data points from a stored as rows with 14 features in columns
The machine failure consists of five independent failure modes 10. tool wear failure (TWF): the tool will be replaced of fail at a randomly selected tool wear time between 200 - 240 mins (120 times in our dataset). At this point in time, the tool is replaced 69 times, and fails 51 times (randomly assigned). 11. heat dissipation failure (HDF): heat dissipation causes a process failure, if the difference between air- and process temperature is below 8.6 K and the tools rotational speed is below 1380 rpm. This is the case for 115 data points. 12. power failure (PWF): the product of torque and rotational speed (in rad/s) equals the power required for the process. If this power is below 3500 W or above 9000 W, the process fails, which is the case 95 times in our dataset. 13. overstrain failure (OSF): if the product of tool wear and torque exceeds 11,000 minNm for the L product variant (12,000 M, 13,000 H), the process fails due to overstrain. This is true for 98 datapoints. 14. random failures (RNF): each process has a chance of 0,1 % to fail regardless of its process parameters. This is the case for only 5 datapoints, less than could be expected for 10,000 datapoints in our dataset. If at least one of the above failure modes is true, the process fails and the 'machine failure' label is set to 1. It is therefore not transparent to the machine learning method, which of the failure modes has caused the process to fail.
This dataset is part of the following publication, please cite when using this dataset: S. Matzka, "Explainable Artificial Intelligence for Predictive Maintenance Applications," 2020 Third International Conference on Artificial Intelligence for Industries (AI4I), 2020, pp. 69-74, doi: 10.1109/AI4I49448.2020.00023.
The image of the milling process is the work of Daniel Smyth @ Pexels: https://www.pexels.com/de-de/foto/industrie-herstellung-maschine-werkzeug-10406128/
Digital surfaces and thicknesses of selected hydrogeologic units of the Floridan aquifer system were developed to define an updated hydrogeologic framework as part of the U.S. Geological Survey Groundwater Resources Program. This feature class contains data points used to generate est_10000_TDS raster. It also includes "control" points used to map the 10,000 boundary including time-domeain electromagnetic soundings, data source is written communication from Pat Burger, St. Johns River, Water Managment District, 2013 and from other sources.
Gravity data measures small changes in gravity due to changes in the density of rocks beneath the Earth's surface. The data collected are processed via standard methods to ensure the response recorded is that due only to the rocks in the ground. The results produce datasets that can be interpreted to reveal the geological structure of the sub-surface. The processed data is checked for quality by GA geophysicists to ensure that the final data released by GA are fit-for-purpose. This Texas Gravity Data (P199841) contains a total of 2529 point data values acquired at a spacing between 2000 and 10000 metres. The data is located in QLD and were acquired in 1998, under project No. 199841 for Geological Survey of Queensland (GSQ).
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This study provides data from the World Bank's PovcalNet on the distribution of household income and consumption across populations for 942 country-years, organized in dta and csv files by region. Each distribution contains 10,000 data points, one for each 0.01 incremental increase in percent of people living in households at or below a given income or consumption level. In addition, a data set containing the estimated parameters of the Beta and General Quadratic Lorenz curves is provided. For reference, we also provide the Python scripts used to query the PovcalNet online tool and export data from the Mongo database used to store results of these queries, along with all do files used to clean and construct the final data sets and summary statistics.
This dataset provides a comprehensive list of OLD and NEW car prices in the market, with information on various factors such as car make, year, model, transmission type, and more. With over 10,000 data points, this dataset allows for in-depth analysis and exploration of the dynamics of car prices in the market, making it a valuable resource for researchers, analysts, and car enthusiasts alike.
Here you find 78612 records about used cars: 60 Brand, 382 Model, 33 Modelyear, 1839 CarModel, 1397 AveragePrice, 893 MinimumPrice, 916 MaximumPrice, over 128 Months/Year.
Here you find 3433 records about new cars: 1119 OldPrice, 410 ChangValue, 1162 NewPrice with 268 ChangeDate, on 49 Brand, 178 Model, over 4 Years
1- Price Prediction: The dataset contains information about various car models, such as their brand, model, year, fuel type, and transmission. This information can be used to predict the price of a car using regression models.
2- Brand Analysis: The dataset contains information about the brand of each car. You can analyze the dataset to see which brand has the highest average price.
3- Transmission Analysis: You can analyze the dataset to see how the price of a car varies with transmission type. For example, you can see if cars with automatic transmissions have a higher or lower price than cars with manual transmissions.
Pre-trained embeddings for approximate nearest neighbor search using the cosine distance. This dataset consists of two splits:
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('deep1b', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
This dataset is one of a number of datasets containing geomorphological data relating to the Windmill Islands, Wilkes Land, Antarctica. The dataset comprises of a digital point coverage which is linked to a seperate digital data base (ie attribute tables)in which attributes are assigned to topographic profiles and transects and to the respective samples represented along these profiles. The coverage has been built for lines and points and attribute tables profile.aat and profile.pat assigned the following items respectively :
profile_name, descript, descript1, descript2, descript3 and profile.pat :
profile_name, site, s_elev, br_elev, s_elev_source, br_elev_source, s_elev_qual, br_elev_qual. Does not conform to Geoscience Australia's Data Dictionary as too detailed.
These data were compiled by Dr Ian D Goodwin from his own field notes and from the records of other workers. See the linked document at the URL below for further information.
This synthetic dataset is modeled after an existing milling machine and consists of 10 000 data points from a stored as rows with 14 features in columns
Pre-trained embeddings for approximate nearest neighbor search using the Euclidean distance. This dataset consists of two splits:
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('sift1m', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is used for training the TRENDY method for gene regulatory network inference. It also contains the SINC test data set.
For a brief description of the code for TRENDY method, see https://github.com/YueWangMathbio/TRENDY.
See https://github.com/YueWangMathbio/TRENDY/blob/main/GRN_transformer.pdf for the manuscript of TRENDY method.
To use the data:
1 download all files from https://github.com/YueWangMathbio/TRENDY
2 download all files from this database (both https://zenodo.org/records/14927741 and https://zenodo.org/records/13929908)
3 in the folder with all files from GitHub, creat a folder named "total_data_10", and unzip all files with name "dataset....zip" in this folder
4 unzip "rev_wendy_all_10.zip" in the folder with all files from GitHub
5 unzip "SINC_data.zip", and the files into the folder "SINC"
The "total_data_10" folder will contain 102 groups of data, where each group has eight files with different name endings:
xxx_A: 1000 ground truth gene regulatory networks, each of size 10*10
xxx_cov: 11000 covariance matrices for 1000 samples at 11 time points, each of size 10*10
xxx_data: 1000 gene expression samples, each of size 100*10*11 (100 cells, 10 genes, 11 time points)
xxx_genie: 10000 inferred gene regulatory networks by GENIE3 method for 1000 samples at 10 time points, each of size 10*10
xxx_nlode: 1000 inferred gene regulatory networks by NonlinearODEs method for 1000 samples, each of size 10*10
xxx_revcov: 10000 constructed pseudo covariance matrices for 1000 samples at 10 time points, each of size 10*10
xxx_sinc:1000 inferred gene regulatory networks by SINCERITIES method for 1000 samples, each of size 10*10
xxx_wendy: 10000 inferred gene regulatory networks by WENDY method for 1000 samples at 10 time points, each of size 10*10
The "rev_wendy_all_10" folder will contain two groups of data, where each group has eight files with different name endings:
xxx_ktstar: 10000 inferred covariance matrices by the first half of TRENDY for 1000 samples at 10 time points, each of size 10*10
xxx_revwendy: 10000 inferred gene regulatory networks by the first half of TRENDY for 1000 samples at 10 time points, each of size 10*10
The first 100 group with numbering are for training. The one group with "val" is for validation. The one group with "test" is for testing.
If you want to train or test new GRN inference methods, then just use the xxx_A files and xxx_data files.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Navier-Stokes Simulated Flow Dataset for PINNs
Welcome to the Dataset!
Dive into the dynamic world of fluid flow with the Navier-Stokes Simulated Flow Dataset for PINNs! This collection of 10,000 simulated data points captures the essence of fluid dynamics in a 2D channel, tailored specifically for training Physics-Informed Neural Networks (PINNs). With an even split of 5,000 laminar flow and 5,000 turbulent flow points, this dataset is perfect for researchers, data… See the full description on the dataset page: https://huggingface.co/datasets/Allanatrix/CFD.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This file contain metadata for 10,000 movies. The dataset consists of movies released on, or before June 2025. Data points include Movie Title, Tmdb id, original language, Genres, Release date, Revenue, budget, Runtime and Overview of movie.
This dataset consists of following files:
popular_movies.csv: Contains information about movies i.e. title, tmdb id, original_language, genres, release date, revenue, budget, runtime and overview.
credits.csv: This file details about casts in movie and crew members worked on movie.
This dataset is an ensemble of data collected from TMDB. The Movie Details, Credits and Keywords have been collected from the TMDB Open API. This product uses the TMDb API but is not endorsed or certified by TMDb. Their API also provides access to data on many additional movies, actors and actresses, crew members, and TV shows. You can try it for yourself here.
Georeferenced vector type database, containing the geomorphological and anthropic elements in punctual form of the mountain regional territory, surveyed at the acquisition scale 1:10,000. The geographical area covered includes the regional Apennine territory.
Dataset Card for MedSynth
The MedSynth dataset contains synthetic medical dialogue–note pairs developed for the medical dialogue-to-note summarization task.
Dataset Details
Dataset Description
The dataset covers 2000 ICD-10 codes, with five data points per code, resulting in a total of more than 10,000 data points. The notes are in SOAP format.
Uses
MedSynth should not be used as a reliable source of medical information. It is intended solely to… See the full description on the dataset page: https://huggingface.co/datasets/Ahmad0067/MedSynth.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview This dataset contains input-output data of a damped nonlinear pendulum that is actuated at the mounting point. The data was generated with statesim [1], a python package for simulating linear and nonlinear ODEs, for the system actuated pendulum. The configuration .json files for the corresponding datasets (in-distribution and out-of-distribution) can be found in the respective folders. After creating the dataset, the files are stored in the raw folder. Then, they are split into subsets for training, testing, and validation and can be found in the processed folder; details about the splitting are found in the config.json file. The dataset can be used to test system identification algorithms and methods that aim to identify nonlinear dynamics from input-output measurements. The training dataset is used to optimize the model parameters, the validation set for hyperparameter optimization, and the test set only for the final evaluation. In [2], the authors used the same underlying dynamics to create their dataset but without damping terms. Input generation Input trajectories are sampled from a multivariate-normal distribution. Noise Gaussian white noise of approximately 30dB is added at the output. Statistics The input and output size is one. In-distribution data: 2 100 000 data points Training: 10 000 trajectories of length 150 Validation: 2 000 trajectories of length 150 Test: 2 000 trajectories of length 150 Out-of-distribution data: 7 times 100 000 data points 7 different datasets were only used for testing. Each dataset contains 200 trajectories of length 500. References Frank, D. statesim [Computer software]. https://github.com/Dany-L/statesim Lu, L., Jin, P., Pang, G., Zhang, Z., & Karniadakis, G. E. (2021). Learning nonlinear operators via DeepONet based on the universal approximation theorem of operators. Nature machine intelligence, 3(3), 218-229.
Gravity data measures small changes in gravity due to changes in the density of rocks beneath the Earth's surface. The data collected are processed via standard methods to ensure the response recorded is that due only to the rocks in the ground. The results produce datasets that can be interpreted to reveal the geological structure of the sub-surface. The processed data is checked for quality by GA geophysicists to ensure that the final data released by GA are fit-for-purpose. This Gravity Survey (P198089) contains a total of 461 point data values acquired at a spacing between 450 and 10000 metres. The data is located in SA and were acquired in 1980, under project No. 198089 for None.
Access our data for free: https://matrix.blocksize.capital/auth/open/sign-up
The Blocksize 30-Minute VWAP Feed provides precise, time-anchored pricing snapshots for digital assets, updated every 30 minutes around the clock. Designed for use cases where regular and unbiased price reference points are essential — such as portfolio valuation, fund NAV calculation, settlement, or compliance reporting — this feed offers volume-weighted average prices based on executed trades across a broad and continuously vetted set of exchanges.
Each pricing point is calculated using trade data observed during the 30-minute interval immediately preceding each half-hour mark (e.g., 00:30, 01:00, 01:30 UTC, etc.). For each interval, the final price is derived from the volume-weighted average of the last trade events on all reporting exchanges. This method ensures that higher-volume trades contribute more significantly to the resulting price, offering a fair and liquidity-sensitive reflection of market value.
To ensure accuracy and data integrity, only validated trade events with complete volume, price, and timestamp information are considered. Any incomplete, malformed, or delayed exchange data is automatically excluded from the calculation. In the rare event that no valid data is available for a given interval, the feed defaults to the last available valid price to preserve pricing continuity — a critical feature for settlement systems and automated pipelines.
The feed also benefits from active oversight and quality assurance by Blocksize’s internal data committee. Exchanges that show recurring anomalies or inconsistencies are removed from the input set until verified corrections are made, while new sources are added only after rigorous integrity checks. This combination of automation, governance, and data hygiene ensures that the 30-minute VWAP feed remains a trusted pricing oracle for digital asset markets, even during volatile or low-liquidity periods.
Our Customers:
Questions? Reach out to our qualified data team.
PII Statement: Our datasets does not include personal, pseudonymized, or sensitive user data.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
A realistic, large-scale synthetic dataset of 10,000 students designed to analyze factors affecting college placements.
This dataset simulates the academic and professional profiles of 10,000 college students, focusing on factors that influence placement outcomes. It includes features like IQ, academic performance, CGPA, internships, communication skills, and more.
The dataset is ideal for:
Column Name | Description |
---|---|
College_ID | Unique ID of the college (e.g., CLG0001 to CLG0100) |
IQ | Student’s IQ score (normally distributed around 100) |
Prev_Sem_Result | GPA from the previous semester (range: 5.0 to 10.0) |
CGPA | Cumulative Grade Point Average (range: ~5.0 to 10.0) |
Academic_Performance | Annual academic rating (scale: 1 to 10) |
Internship_Experience | Whether the student has completed any internship (Yes/No) |
Extra_Curricular_Score | Involvement in extracurriculars (score from 0 to 10) |
Communication_Skills | Soft skill rating (scale: 1 to 10) |
Projects_Completed | Number of academic/technical projects completed (0 to 5) |
Placement | Final placement result (Yes = Placed, No = Not Placed) |
This dataset was generated to resemble real-world data in academic institutions for research and machine learning use. While it is synthetic, the variables and relationships are crafted to mimic authentic trends observed in student placements.
MIT
Created using Python (NumPy, Pandas) with data logic designed for educational and ML experimentation purposes.