14 datasets found

automatic dirt detection vacuum cleaner
kaggle.com
Updated Jun 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
bisma nawal (2024). automatic dirt detection vacuum cleaner [Dataset]. https://www.kaggle.com/datasets/bismanawal/automatic-dirt-detection-vacuum-cleaner
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 28, 2024
Dataset provided by
Kaggle
Authors
bisma nawal
Description
This Python program simulates an automatic vacuum cleaner in a room using a dataset. The vacuum cleaner detects dirt and obstacles, cleans the dirt, and avoids obstacles. The program reads the room layout from a CSV file, processes each cell to check for dirt or obstacles, and updates the room status accordingly
Python Codes for Data Analysis of The Impact of COVID-19 on Technical...
figshare.com
dataverse.harvard.edu
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elizabeth Szkirpan (2022). Python Codes for Data Analysis of The Impact of COVID-19 on Technical Services Units Survey Results [Dataset]. http://doi.org/10.6084/m9.figshare.20416092.v1
Explore at:
Unique identifier
https://doi.org/10.6084/m9.figshare.20416092.v1
Dataset updated
Aug 1, 2022
Dataset provided by
Figsharehttp://figshare.com/
Authors
Elizabeth Szkirpan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Copies of Anaconda 3 Jupyter Notebooks and Python script for holistic and clustered analysis of "The Impact of COVID-19 on Technical Services Units" survey results. Data was analyzed holistically using cleaned and standardized survey results and by library type clusters. To streamline data analysis in certain locations, an off-shoot CSV file was created so data could be standardized without compromising the integrity of the parent clean file. Three Jupyter Notebooks/Python scripts are available in relation to this project: COVID_Impact_TechnicalServices_HolisticAnalysis (a holistic analysis of all survey data) and COVID_Impact_TechnicalServices_LibraryTypeAnalysis (a clustered analysis of impact by library type, clustered files available as part of the Dataverse for this project).

Starlink Satellite TLE/CSV dataset (April 2025)

kaggle.com

Updated Jun 9, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Vijay J0shi (2025). Starlink Satellite TLE/CSV dataset (April 2025) [Dataset]. https://www.kaggle.com/datasets/vijayj0shi/starlink-satellite-tlecsv-dataset-april-2025

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jun 9, 2025

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Vijay J0shi

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

CelesTrak Starlink TLE Data (CSV Format)

About This Dataset

This dataset contains Starlink satellite data in both CSV and TLE formats. At the top level, it includes four files: one set representing a snapshot of all Starlink satellites at a specific time and another set representing a time-range dataset for STARLINK-1008 from March 11 to April 10, 2025. Additionally, there is a folder named STARLINK_INDIVIDUAL_SATELLITE_CSV_TLE_FILES_WITH_TIME_RANGE, which contains per-satellite data files in both CSV and TLE formats. These cover the time range from January 1, 2024, to June 6, 2025, for individual satellites. The number of files varies as satellites may have been launched at different times within this period.

This dataset contains processed CSV versions of Starlink satellite data originally available from CelesTrak, a publicly available source for satellite orbital information.

CelesTrak publishes satellite position data in TLE (Two-Line Element) format, which describes a satellite’s orbit using two compact lines of text. While TLE is the standard format used by satellite agencies, it is difficult to interpret directly for beginners. So this dataset provides a cleaned and structured CSV version that is easier to use with Python and data science libraries.

What's Inside

Each file in the dataset corresponds to a specific Starlink satellite and contains its orbital data over a range of dates (usually 1 month). Each row is a snapshot of the satellite's position and movement at a given timestamp.

Key columns include:

Column Name	Description
Satellite_Name	Unique identifier for each Starlink satellite. Example: `STARLINK-1008`.
Epoch	The timestamp (in UTC) representing the exact moment when the satellite's orbital data was recorded.
Inclination_deg	Angle between the satellite’s orbital plane and Earth’s equator. 0° means equatorial orbit; 90° means polar orbit.
Eccentricity	Describes the shape of the orbit. 0 = perfect circle; values approaching 1 = highly elliptical.
Mean_Motion_orbits_per_day	Number of orbits the satellite completes around Earth in a single day.
Altitude_km	Satellite’s altitude above Earth’s surface in kilometers, calculated from orbital parameters.
Latitude	Satellite’s geographic latitude at the recorded time. Positive = Northern Hemisphere, Negative = Southern Hemisphere.
Longitude	Satellite’s geographic longitude at the recorded time. Positive = East of Prime Meridian, Negative = West.

Why CSV?

TLE is a compact format used in aerospace and satellite communications, but:

It is not beginner-friendly.
It requires a dedicated parser.
It’s difficult to visualize or analyze directly.

That’s why this dataset presents the same orbital data but in a clean and normalized CSV structure ready for analysis and machine learning.

Use Cases

Satellite orbit visualization.
Time-series analysis of Starlink constellations.
Anomaly detection (e.g., using autoencoders or clustering).
Feature engineering for orbit-based models.
Educational projects for learning satellite mechanics.

Data Source

Source: CelesTrak Starlink TLE Feed
Converted to CSV using custom Python scripts.
Time range: Typically one month per file (can vary).

Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic...
zenodo.org
data.niaid.nih.gov
bin, csv, zip
Updated Dec 24, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexander R. Hartloper; Alexander R. Hartloper; Selimcan Ozden; Albano de Castro e Sousa; Dimitrios G. Lignos; Dimitrios G. Lignos; Selimcan Ozden; Albano de Castro e Sousa (2022). Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials [Dataset]. http://doi.org/10.5281/zenodo.6965147
Explore at:
bin, zip, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6965147
Dataset updated
Dec 24, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Alexander R. Hartloper; Alexander R. Hartloper; Selimcan Ozden; Albano de Castro e Sousa; Dimitrios G. Lignos; Dimitrios G. Lignos; Selimcan Ozden; Albano de Castro e Sousa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials

Background

This dataset contains data from monotonic and cyclic loading experiments on structural metallic materials. The materials are primarily structural steels and one iron-based shape memory alloy is also included. Summary files are included that provide an overview of the database and data from the individual experiments is also included.

The files included in the database are outlined below and the format of the files is briefly described. Additional information regarding the formatting can be found through the post-processing library (https://github.com/ahartloper/rlmtp/tree/master/protocols).

Usage

The data is licensed through the Creative Commons Attribution 4.0 International.

If you have used our data and are publishing your work, we ask that you please reference both:

this database through its DOI, and

any publication that is associated with the experiments. See the Overall_Summary and Database_References files for the associated publication references.

Included Files

Overall_Summary_2022-08-25_v1-0-0.csv: summarises the specimen information for all experiments in the database.

Summarized_Mechanical_Props_Campaign_2022-08-25_v1-0-0.csv: summarises the average initial yield stress and average initial elastic modulus per campaign.

Unreduced_Data-#_v1-0-0.zip: contain the original (not downsampled) data

Where # is one of: 1, 2, 3, 4, 5, 6. The unreduced data is broken into separate archives because of upload limitations to Zenodo. Together they provide all the experimental data.

We recommend you un-zip all the folders and place them in one "Unreduced_Data" directory similar to the "Clean_Data"

The experimental data is provided through .csv files for each test that contain the processed data. The experiments are organised by experimental campaign and named by load protocol and specimen. A .pdf file accompanies each test showing the stress-strain graph.

There is a "db_tag_clean_data_map.csv" file that is used to map the database summary with the unreduced data.

The computed yield stresses and elastic moduli are stored in the "yield_stress" directory.

Clean_Data_v1-0-0.zip: contains all the downsampled data

The experimental data is provided through .csv files for each test that contain the processed data. The experiments are organised by experimental campaign and named by load protocol and specimen. A .pdf file accompanies each test showing the stress-strain graph.

There is a "db_tag_clean_data_map.csv" file that is used to map the database summary with the clean data.

The computed yield stresses and elastic moduli are stored in the "yield_stress" directory.

Database_References_v1-0-0.bib

Contains a bibtex reference for many of the experiments in the database. Corresponds to the "citekey" entry in the summary files.

File Format: Downsampled Data

These are the "LP_

The header of the first column is empty: the first column corresponds to the index of the sample point in the original (unreduced) data

Time[s]: time in seconds since the start of the test

e_true: true strain

Sigma_true: true stress in MPa

(optional) Temperature[C]: the surface temperature in degC

These data files can be easily loaded using the pandas library in Python through:

import pandas data = pandas.read_csv(data_file, index_col=0)

The data is formatted so it can be used directly in RESSPyLab (https://github.com/AlbanoCastroSousa/RESSPyLab). Note that the column names "e_true" and "Sigma_true" were kept for backwards compatibility reasons with RESSPyLab.

File Format: Unreduced Data

These are the "LP_

The first column is the index of each data point

S/No: sample number recorded by the DAQ

System Date: Date and time of sample

Time[s]: time in seconds since the start of the test

C_1_Force[kN]: load cell force

C_1_Déform1[mm]: extensometer displacement

C_1_Déplacement[mm]: cross-head displacement

Eng_Stress[MPa]: engineering stress

Eng_Strain[]: engineering strain

e_true: true strain

Sigma_true: true stress in MPa

(optional) Temperature[C]: specimen surface temperature in degC

The data can be loaded and used similarly to the downsampled data.

File Format: Overall_Summary

The overall summary file provides data on all the test specimens in the database. The columns include:

hidden_index: internal reference ID

grade: material grade

spec: specifications for the material

source: base material for the test specimen

id: internal name for the specimen

lp: load protocol

size: type of specimen (M8, M12, M20)

gage_length_mm_: unreduced section length in mm

avg_reduced_dia_mm_: average measured diameter for the reduced section in mm

avg_fractured_dia_top_mm_: average measured diameter of the top fracture surface in mm

avg_fractured_dia_bot_mm_: average measured diameter of the bottom fracture surface in mm

fy_n_mpa_: nominal yield stress

fu_n_mpa_: nominal ultimate stress

t_a_deg_c_: ambient temperature in degC

date: date of test

investigator: person(s) who conducted the test

location: laboratory where test was conducted

machine: setup used to conduct test

pid_force_k_p, pid_force_t_i, pid_force_t_d: PID parameters for force control

pid_disp_k_p, pid_disp_t_i, pid_disp_t_d: PID parameters for displacement control

pid_extenso_k_p, pid_extenso_t_i, pid_extenso_t_d: PID parameters for extensometer control

citekey: reference corresponding to the Database_References.bib file

yield_stress_mpa_: computed yield stress in MPa

elastic_modulus_mpa_: computed elastic modulus in MPa

fracture_strain: computed average true strain across the fracture surface

c,si,mn,p,s,n,cu,mo,ni,cr,v,nb,ti,al,b,zr,sn,ca,h,fe: chemical compositions in units of %mass

file: file name of corresponding clean (downsampled) stress-strain data

File Format: Summarized_Mechanical_Props_Campaign

Meant to be loaded in Python as a pandas DataFrame with multi-indexing, e.g.,

tab1 = pd.read_csv('Summarized_Mechanical_Props_Campaign_' + date + version + '.csv', index_col=[0, 1, 2, 3], skipinitialspace=True, header=[0, 1], keep_default_na=False, na_values='')

citekey: reference in "Campaign_References.bib".

Grade: material grade.

Spec.: specifications (e.g., J2+N).

Yield Stress [MPa]: initial yield stress in MPa

size, count, mean, coefvar: number of experiments in campaign, number of experiments in mean, mean value for campaign, coefficient of variation for campaign

Elastic Modulus [MPa]: initial elastic modulus in MPa

size, count, mean, coefvar: number of experiments in campaign, number of experiments in mean, mean value for campaign, coefficient of variation for campaign

Caveats

The files in the following directories were tested before the protocol was established. Therefore, only the true stress-strain is available for each:

A500

A992_Gr50

BCP325

BCR295

HYP400

S460NL

S690QL/25mm

S355J2_Plates/S355J2_N_25mm and S355J2_N_50mm
Danish Residential Housing Prices 1992-2024
kaggle.com
Updated Nov 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martin Frederiksen (2024). Danish Residential Housing Prices 1992-2024 [Dataset]. https://www.kaggle.com/datasets/martinfrederiksen/danish-residential-housing-prices-1992-2024
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 29, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Martin Frederiksen
Description
Danish residential house prices (1992-2024)

About the dataset (cleaned data)

The dataset (parquet file) contains approximately 1,5 million residential household sales from Denmark during the periode from 1992 to 2024. All cleaned data is merged into one parquet file here on Kaggle. Note some cleaning might still be nessesary, see notebook under code.

Also, added a random sample (100k) of the dataset as a csv file.

Done in Python version: 2.6.3.

Raw data

Raw data and more info is avaible on Github repositary: https://github.com/MartinSamFred/Danish-residential-housingPrices-1992-2024.git

The dataset has been scraped and cleaned (to some extent). Cleaned files are located in: \Housing_data_cleaned \ named DKHousingprices_1 and 2. Saved in parquet format (and saved as two files due to size).

Cleaning from raw files to above cleaned files is outlined in BoligsalgConcatCleanigGit.ipynb. (done in Python version: 2.6.3)

Webscraping script: Webscrape_script.ipynb (done in Python version: 2.6.3)

Provided you want to clean raw files from scratch yourself:

Uncleaned scraped files (81 in total) are located in \Housing_data_raw \ Housing_data_batch1 and 2. Saved in .csv format and compressed as 7-zip files.

Additional files added/appended to the Cleaned files are located in \Addtional_data and named DK_inflation_rates, DK_interest_rates, DK_morgage_rates and DK_regions_zip_codes. Saved in .xlsx format.

Content

Each row in the dataset contains a residential household sale during the period 1992 - 2024.

“Cleaned files” columns:

0 'date': is the transaction date

1 'quarter': is the quarter based on a standard calendar year

2 'house_id': unique house id (could be dropped)

3 'house_type': can be 'Villa', 'Farm', 'Summerhouse', 'Apartment', 'Townhouse'

4 'sales_type': can be 'regular_sale', 'family_sale', 'other_sale', 'auction', '-' (“-“ could be dropped)

5 'year_build': range 1000 to 2024 (could be narrowed more)

6 'purchase_price': is purchase price in DKK

7 '%_change_between_offer_and_purchase': could differ negatively, be zero or positive

8 'no_rooms': number of rooms

9 'sqm': number of square meters

10 'sqm_price': 'purchase_price' divided by 'sqm_price'

11 'address': is the address

12 'zip_code': is the zip code

13 'city': is the city

14 'area': 'East & mid jutland', 'North jutland', 'Other islands', 'Capital, Copenhagen', 'South jutland', 'North Zealand', 'Fyn & islands', 'Bornholm'

15 'region': 'Jutland', 'Zealand', 'Fyn & islands', 'Bornholm'

16 'nom_interest_rate%': Danish nominal interest rate show pr. quarter however actual rate is not converted from annualized to quarterly

17 'dk_ann_infl_rate%': Danish annual inflation rate show pr. quarter however actual rate is not converted from annualized to quarterly

18 'yield_on_mortgage_credit_bonds%': 30 year mortgage bond rate (without spread)

Uses

Various (statistical) analysis, visualisation and I assume machine learning as well.

Practice exercises etc.

Uncleaned scraped files are great to practice cleaning, especially string cleaning. I’m not an expect as seen in the coding ;-).

Disclaimer

The data and information in the data set provided here are intended to be used primarily for educational purposes only. I do not own any data, and all rights are reserved to the respective owners as outlined in “Acknowledgements/sources”. The accuracy of the dataset is not guaranteed accordingly any analysis and/or conclusions is solely at the user's own responsibly and accountability.

Acknowledgements/sources

All data is publicly available on:

Boliga: https://www.boliga.dk/

Finans Danmark: https://finansdanmark.dk/

Danmarks Statistik: https://www.dst.dk/da

Statistikbanken: https://statistikbanken.dk/statbank5a/default.asp?w=2560

Macrotrends: https://www.macrotrends.net/

PostNord: https://www.postnord.dk/

World Data: https://www.worlddata.info/

Dataset picture / cover photo: Nick Karvounis (https://unsplash.com/)

Have fun… :-)
D
CompuCrawl: Full database and code
dataverse.nl
Updated Sep 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Richard Haans; Richard Haans (2025). CompuCrawl: Full database and code [Dataset]. http://doi.org/10.34894/OBVAOY
Explore at:
Unique identifier
https://doi.org/10.34894/OBVAOY
Dataset updated
Sep 23, 2025
Dataset provided by
DataverseNL
Authors
Richard Haans; Richard Haans
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This folder contains the full set of code and data for the CompuCrawl database. The database contains the archived websites of publicly traded North American firms listed in the Compustat database between 1996 and 2020\u2014representing 11,277 firms with 86,303 firm/year observations and 1,617,675 webpages in the final cleaned and selected set.The files are ordered by moment of use in the work flow. For example, the first file in the list is the input file for code files 01 and 02, which create and update the two tracking files "scrapedURLs.csv" and "URLs_1_deeper.csv" and which write HTML files to its folder. "HTML.zip" is the resultant folder, converted to .zip for ease of sharing. Code file 03 then reads this .zip file and is therefore below it in the ordering.The full set of files, in order of use, is as follows:Compustat_2021.xlsx: The input file containing the URLs to be scraped and their date range.01 Collect frontpages.py: Python script scraping the front pages of the list of URLs and generating a list of URLs one page deeper in the domains.URLs_1_deeper.csv: List of URLs one page deeper on the main domains.02 Collect further pages.py: Python script scraping the list of URLs one page deeper in the domains.scrapedURLs.csv: Tracking file containing all URLs that were accessed and their scraping status.HTML.zip: Archived version of the set of individual HTML files.03 Convert HTML to plaintext.py: Python script converting the individual HTML pages to plaintext.TXT_uncleaned.zip: Archived version of the converted yet uncleaned plaintext files.input_categorization_allpages.csv: Input file for classification of pages using GPT according to their HTML title and URL.04 GPT application.py: Python script using OpenAI\u2019s API to classify selected pages according to their HTML title and URL.categorization_applied.csv: Output file containing classification of selected pages.exclusion_list.xlsx: File containing three sheets: 'gvkeys' containing the GVKEYs of duplicate observations (that need to be excluded), 'pages' containing page IDs for pages that should be removed, and 'sentences' containing (sub-)sentences to be removed.05 Clean and select.py: Python script applying data selection and cleaning (including selection based on page category), with setting and decisions described at the top of the script. This script also combined individual pages into one combined observation per GVKEY/year.metadata.csv: Metadata containing information on all processed HTML pages, including those not selected.TXT_cleaned.zip: Archived version of the selected and cleaned plaintext page files. This file serves as input for the word embeddings application.TXT_combined.zip: Archived version of the combined plaintext files at the GVKEY/year level. This file serves as input for the data description using topic modeling.06 Topic model.R: R script that loads up the combined text data from the folder stored in "TXT_combined.zip", applies further cleaning, and estimates a 125-topic model.TM_125.RData: RData file containing the results of the 125-topic model.loadings125.csv: CSV file containing the loadings for all 125 topics for all GVKEY/year observations that were included in the topic model.125_topprob.xlsx: Overview of top-loading terms for the 125 topic model.07 Word2Vec train and align.py: Python script that loads the plaintext files in the "TXT_cleaned.zip" archive to train a series of Word2Vec models and subsequently align them in order to compare word embeddings across time periods.Word2Vec_models.zip: Archived version of the saved Word2Vec models, both unaligned and aligned.08 Word2Vec work with aligned models.py: Python script which loads the trained Word2Vec models to trace the development of the embeddings for the terms \u201csustainability\u201d and \u201cprofitability\u201d over time.99 Scrape further levels down.py: Python script that can be used to generate a list of unscraped URLs from the pages that themselves were one level deeper than the front page.URLs_2_deeper.csv: CSV file containing unscraped URLs from the pages that themselves were one level deeper than the front page.For those only interested in downloading the final database of texts, the files "HTML.zip", "TXT_uncleaned.zip", "TXT_cleaned.zip", and "TXT_combined.zip" contain the full set of HTML pages, the processed but uncleaned texts, the selected and cleaned texts, and combined and cleaned texts at the GVKEY/year level, respectively.The following webpage contains answers to frequently asked questions: https://haans-mertens.github.io/faq/. More information on the database and the underlying project can be found here: https://haans-mertens.github.io/ and the following article: \u201cThe Internet Never Forgets: A Four-Step Scraping Tutorial, Codebase, and Database for Longitudinal Organizational Website Data\u201d, by Richard F.J. Haans and Marc J. Mertens in Organizational Research Methods. The full paper can be accessed here.
h
alpine1.1-multireq-instructions-seed
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marcus Cedric R. Idia, alpine1.1-multireq-instructions-seed [Dataset]. https://huggingface.co/datasets/marcuscedricridia/alpine1.1-multireq-instructions-seed
Explore at:
Authors
Marcus Cedric R. Idia
Description
This dataset is a refined version of Alpine 1.0. It was created by generating tasks using various LLMs, wrapping them in special elements {Instruction Start} ... {Instruction End}, and saving them in a text file. We then processed this file with a Python script that used regex to extract the tasks into a CSV. Afterward, we cleaned the dataset by removing near-duplicates, vague prompts, and ambiguous entries. python clean.py -i prompts.csv -o cleaned.csv -p "prompt" -t 0.92 -l 30 This dataset… See the full description on the dataset page: https://huggingface.co/datasets/marcuscedricridia/alpine1.1-multireq-instructions-seed.
Comprehensive Formula 1 Dataset (2020-2025)
kaggle.com
Updated Jul 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
V SHREE KAMALESH (2025). Comprehensive Formula 1 Dataset (2020-2025) [Dataset]. https://www.kaggle.com/datasets/vshreekamalesh/comprehensive-formula-1-dataset-2020-2025
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 27, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
V SHREE KAMALESH
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Formula 1 Comprehensive Dataset (2020-2025)

Dataset Description This comprehensive Formula 1 dataset contains detailed racing data spanning from 2020 to 2025, including race results, qualifying sessions, championship standings, circuit information, and historical driver statistics.

Perfect for:

📊 F1 performance analysis

🤖 Machine learning projects

📈 Data visualization

🏆 Championship predictions

📋 Racing statistics research

📁 Files Included 1. f1_race_results_2020_2025.csv (53 entries) Race winners and results from Grand Prix weekends

Date, Grand Prix name, race winner

Constructor, nationality, grid position

Race time, fastest lap time, points scored

f1_qualifying_results_2020_2024.csv (820 entries) Qualifying session results with timing data

Q1, Q2, Q3 session times

Grid positions, laps completed

Driver and constructor information

f1_driver_standings_progressive.csv (600 entries) Championship standings progression throughout seasons

Points accumulation over race weekends

Wins, podiums, pole positions tracking

Season-long championship battle data

f1_constructor_standings_progressive.csv (360 entries) Team championship standings evolution

Constructor points and wins

Team performance metrics

Manufacturer rivalry data

f1_circuits_technical_data.csv (24 entries) Technical specifications for all F1 circuits

Track length, number of turns

Lap records and record holders

Circuit designers and first F1 usage

f1_historical_driver_statistics.csv (30 entries) All-time career statistics for F1 drivers

Career wins, poles, podiums

Racing entries and achievements

Active and retired driver records

f1_comprehensive_dataset_2020_2025.csv (432 entries) MAIN DATASET - Combined data from all sources

Multiple data types in one file

Ready for immediate analysis

Comprehensive F1 information hub

🔧 Data Features Clean & Structured: All data professionally format
m
ML Classifiers, Human-Tagged Datasets, and Validation Code for Structured...
data.mendeley.com
Updated Aug 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christopher Lynch (2025). ML Classifiers, Human-Tagged Datasets, and Validation Code for Structured LLM-Generated Event Messaging: BERT, Keras, XGBoost, and Ensemble Methods [Dataset]. http://doi.org/10.17632/g2sdzmssgh.1
Explore at:
Unique identifier
https://doi.org/10.17632/g2sdzmssgh.1
Dataset updated
Aug 15, 2025
Authors
Christopher Lynch
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset and code package supports the reproducible evaluation of Structured Large Language Model (LLM)-generated event messaging using multiple machine learning classifiers, including BERT (via TensorFlow/Keras), XGBoost, and ensemble methods. Including:

Tagged datasets (.csv): human-tagged gold labels for evaluation

Untagged datasets (.csv): raw data with Prompt matched to corresponding LLM-generated narrative

Suitable for inference, semi-automatic labeling, or transfer learning

Python and R code for preprocessing, model training, evaluation, and visualization

Configuration files and environment specifications to enable end-to-end reproducibility

The materials accompany the study presented in [Lynch, Christopher, Erik Jensen, Ross Gore, et al. "AI-Generated Messaging for Life Events Using Structured Prompts: A Comparative Study of GPT With Human Experts and Machine Learning." TechRxiv (2025), DOI: https://doi.org/10.36227/techrxiv.174123588.85605769/v1], where Structured Narrative Prompting was applied to generate life-event messages from LLMs, followed by human annotation, and machine learning validation. This release provides complete transparency for reproducing reported metrics and facilitates further benchmarking in multilingual or domain-specific contexts.

Value of the Data: * Enables direct replication of published results across BERT, Keras-based models, XGBoost, and ensemble classifiers. * Provides clean, human-tagged datasets suitable for training, evaluation, and bias analysis. * Offers untagged datasets for new annotation or domain adaptation. * Contains full preprocessing, training, and visualization code in Python and R for flexibility across workflows. * Facilitates extension into other domains (e.g., multilingual LLM messaging validation).

Data Description: * /data/tagged/*.csv – Human-labeled datasets with schema defined in data_dictionary.csv. * /data/untagged/*.csv – Clean datasets without labels for inference or annotation. * /code/python/ – Python scripts for preprocessing, model training (BERT, Keras DNN, XGBoost), ensembling, evaluation metrics, and plotting. * /code/r/ – R scripts for exploratory data analysis, statistical testing, and replication of key figures/tables.

File Formats: * Data: CSV (UTF-8, RFC 4180) * Code: .py, .R, .Rproj

Ethics & Licensing * All data are de-identified and contain no PII. * Released under CC BY 4.0 (data) and MIT License (code).

Limitations * Labels reflect annotator interpretations and may encode bias. * Models trained on English text; generalization to other languages requires adaptation.

Funding Note * Funding sources provided time in support of human taggers annotating the data sets.
d
Data from: Data to Estimate Water Use Associated with Oil and Gas...
catalog.data.gov
data.usgs.gov
+1more
Updated Oct 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Data to Estimate Water Use Associated with Oil and Gas Development within the Bureau of Land Management Carlsbad Field Office Area, New Mexico [Dataset]. https://catalog.data.gov/dataset/data-to-estimate-water-use-associated-with-oil-and-gas-development-within-the-bureau-of-la
Explore at:
Dataset updated
Oct 1, 2025
Dataset provided by
U.S. Geological Survey
Area covered
New Mexico
Description
The purpose of this data release is to provide data in support of the Bureau of Land Management's (BLM) Reasonably Foreseeable Development (RFD) Scenario by estimating water-use associated with oil and gas extraction methods within the BLM Carlsbad Field Office (CFO) planning area, located in Eddy and Lea Counties as well as part of Chaves County, New Mexico. Three comma separated value files and two python scripts are included in this data release. It was determined that all reported oil and gas wells within Chaves County from the FracFocus and New Mexico Oil Conservation Division (NM OCD) databases were outside of the CFO administration area and were excluded from well_records.csv and modeled_estimates.csv. Data from Chaves County are included in the produced_water.csv file to be consistent with the BLM’s water support document. Data were synthesized into comma separated values which include, produced_water.csv (volume) from NM OCD, well_records.csv (including location and completion) from NM OCD and FracFocus, and modeled_estimates.csv (using FracFocus as well as Ball and others (2020) as input data). The results from modeled_estimates.csv were obtained using a previously published regression model (McShane and McDowell, 2021) to estimate water use associated with unconventional oil and gas activities in the Permian Basin (Valder and others, 2021) for the period of interest (2010-2021). Additionally, python scripts to process, clean, and categorize FracFocus data are provided in this data release.
SWE Bench Verified
kaggle.com
huggingface.co
Updated Aug 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harry Wang (2024). SWE Bench Verified [Dataset]. https://www.kaggle.com/datasets/harrywang/swe-bench-verified
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 20, 2024
Dataset provided by
Kaggle
Authors
Harry Wang
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
See details from OpenAI: https://openai.com/index/introducing-swe-bench-verified/

Converted from Parquet to CSV from https://huggingface.co/datasets/princeton-nlp/SWE-bench_Verified

Data Summary from Huggingface:

SWE-bench Verified is a subset of 500 samples from the SWE-bench test set, which have been human-validated for quality. SWE-bench is a dataset that tests systems’ ability to solve GitHub issues automatically. See this post for more details on the human-validation process.

The dataset collects 500 test Issue-Pull Request pairs from popular Python repositories. Evaluation is performed by unit test verification using post-PR behavior as the reference solution.

The original SWE-bench dataset was released as part of SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Want to run inference now? This dataset only contains the problem_statement (i.e. issue text) and the base_commit which represents the state of the codebase before the issue has been resolved. If you want to run inference using the "Oracle" or BM25 retrieval settings mentioned in the paper, consider the following datasets.

princeton-nlp/SWE-bench_Lite_oracle

princeton-nlp/SWE-bench_Lite_bm25_13K

princeton-nlp/SWE-bench_Lite_bm25_27K

Supported Tasks and Leaderboards SWE-bench proposes a new task: issue resolution provided a full repository and GitHub issue. The leaderboard can be found at www.swebench.com

Languages The text of the dataset is primarily English, but we make no effort to filter or otherwise clean based on language type.

Dataset Structure

An example of a SWE-bench datum is as follows:

instance_id: (str) - A formatted instance identifier, usually as repo_owner_repo_name-PR-number.

patch: (str) - The gold patch, the patch generated by the PR (minus test-related code), that resolved the issue.

repo: (str) - The repository owner/name identifier from GitHub.

base_commit: (str) - The commit hash of the repository representing the HEAD of the repository before the solution PR is applied.

hints_text: (str) - Comments made on the issue prior to the creation of the solution PR’s first commit creation date.

created_at: (str) - The creation date of the pull request.

test_patch: (str) - A test-file patch that was contributed by the solution PR. problem_statement: (str) - The issue title and body.

version: (str) - Installation version to use for running evaluation.

environment_setup_commit: (str) - commit hash to use for environment setup and installation.

FAIL_TO_PASS: (str) - A json list of strings that represent the set of tests resolved by the PR and tied to the issue resolution.

PASS_TO_PASS: (str) - A json list of strings that represent tests that should pass before and after the PR application.
E
MaSS - Multilingual corpus of Sentence-aligned Spoken utterances
live.european-language-grid.eu
zenodo.org
npy
Updated Aug 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). MaSS - Multilingual corpus of Sentence-aligned Spoken utterances [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7722
Explore at:
npyAvailable download formats
Dataset updated
Aug 28, 2022
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
AbstractThe CMU Wilderness Multilingual Speech Dataset is a newly published multilingual speech dataset based on recorded readings of the New Testament. It provides data to build Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) models for potentially 700 languages. However, the fact that the source content (the Bible), is the same for all the languages is not exploited to date. Therefore, this article proposes to add multilingual links between speech segments in different languages, and shares a large and clean dataset of 8,130 para-lel spoken utterances across 8 languages (56 language pairs).We name this corpus MaSS (Multilingual corpus of Sentence-aligned Spoken utterances). The covered languages (Basque, English, Finnish, French, Hungarian, Romanian, Russian and Spanish) allow researches on speech-to-speech alignment as well as on translation for syntactically divergent language pairs. The quality of the final corpus is attested by human evaluation performed on a corpus subset (100 utterances, 8 language pairs).Paper | GitHub Repository containing the scripts needed to build the data set from scratch (if needed)Project structureThis repository contains 8 Numpy files, one for each featured language, pickled with Python 3.6. Each line corresponds to the spectrogram of the file mentioned in the file verses.csv. There is a direct mapping between the ID of the verse and its index in the list (thus verse with ID 5634 is located at index 5634 in the Numpy file). Verses not available for a given language (as stated by the value "Not Available" in the CSV file) are represented by empty lists in the Numpy files, thus ensuring a perfect verse-to-verse alignement between each file.Spectrogram were extracted using Librosa with the following parameters:Pre-emphasis = 0.97Sample rate = 16000Window size = 0.025Window stride = 0.01Window type = 'hamming'Mel coefficients = 40Min frequency = 20
Party strength in each US state
kaggle.com
Updated Jan 13, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GeneBurin (2017). Party strength in each US state [Dataset]. https://www.kaggle.com/datasets/kiwiphrases/partystrengthbystate
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 13, 2017
Dataset provided by
Kaggle
Authors
GeneBurin
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
United States
Description
Data on party strength in each US state

The repository contains data on party strength for each state as shown on each state's corresponding party strength Wikipedia page (for example, here is Virginia )

Each state has a table of a detailed summary of the state of its governing and representing bodies on Wikipedia but there is no data set that collates these entries. I scraped each state's Wikipedia table and collated the entries into a single dataset. The data are stored in the state_party_strength.csv and state_party_strength_cleaned.csv. The code that generated the file can be found in corresponding Python notebooks.

Data contents:

The data contain information from 1980 on each state's: 1. governor and party 2. state house and senate composition 3. state representative composition in congress 4. electoral votes

Clean Version

Data in the clean version has been cleaned and processed substantially. Namely: - all columns now contain homogenous data within the column - names and Wiki-citations have been removed - only the party counts and party identification have been left The notebook that created this file is here

Uncleaned Data Version

The data contained herein have not been altered from their Wikipedia tables except in two instances: - Forced column names to be in accord across states - Any needed data modifications (ie concatenated string columns) to retain information when combining columns

To use the data:

Please note that the right encoding for the dataset is "ISO-8859-1", not 'utf-8' though in future versions I will try to fix that to make it more accessible.

This means that you will likely have to perform further data wrangling prior to doing any substantive analysis. The notebook that has been used to create this data file is located here

Raw scraped data

The raw scraped data can be found in the pickle. This file contains a Python dictionary where each key is a US state name and each element is the raw scraped table in Pandas DataFrame format.

Hope it proves as useful to you in analyzing/using political patterns at the state level in the US for political and policy research.
Simulator data
data.europa.eu
zenodo.org
unknown
Updated Jul 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2025). Simulator data [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-7057250?locale=bg
Explore at:
unknown(266936983)Available download formats
Dataset updated
Jul 3, 2025
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The archive contains simulator data in csv format and routine in python enabling their post traitement and plotting. A "README" file explains how to use these routines. These data have been stored during the final EFAICTS project evaluations and used in a publication also available on zenodo: 10.5281/zenodo.6796534
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

bisma nawal (2024). automatic dirt detection vacuum cleaner [Dataset]. https://www.kaggle.com/datasets/bismanawal/automatic-dirt-detection-vacuum-cleaner

automatic dirt detection vacuum cleaner

This Python program simulates an automatic vacuum cleaner in a room using a data

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jun 28, 2024

Dataset provided by

Kaggle

Authors

bisma nawal

Description

This Python program simulates an automatic vacuum cleaner in a room using a dataset. The vacuum cleaner detects dirt and obstacles, cleans the dirt, and avoids obstacles. The program reads the room layout from a CSV file, processes each cell to check for dirt or obstacles, and updates the room status accordingly

Clear search

Close search

Google apps

Main menu

automatic dirt detection vacuum cleaner

Python Codes for Data Analysis of The Impact of COVID-19 on Technical...

Starlink Satellite TLE/CSV dataset (April 2025)

CelesTrak Starlink TLE Data (CSV Format)

About This Dataset

What's Inside

Why CSV?

Use Cases

Data Source

Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic...

Danish Residential Housing Prices 1992-2024

Danish residential house prices (1992-2024)

CompuCrawl: Full database and code

alpine1.1-multireq-instructions-seed

Comprehensive Formula 1 Dataset (2020-2025)

ML Classifiers, Human-Tagged Datasets, and Validation Code for Structured...

Data from: Data to Estimate Water Use Associated with Oil and Gas...

SWE Bench Verified

MaSS - Multilingual corpus of Sentence-aligned Spoken utterances

Party strength in each US state

Data on party strength in each US state

Data contents:

Clean Version

Uncleaned Data Version

To use the data:

Raw scraped data

Simulator data

automatic dirt detection vacuum cleaner

This Python program simulates an automatic vacuum cleaner in a room using a data