This Python program simulates an automatic vacuum cleaner in a room using a dataset. The vacuum cleaner detects dirt and obstacles, cleans the dirt, and avoids obstacles. The program reads the room layout from a CSV file, processes each cell to check for dirt or obstacles, and updates the room status accordingly
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Copies of Anaconda 3 Jupyter Notebooks and Python script for holistic and clustered analysis of "The Impact of COVID-19 on Technical Services Units" survey results. Data was analyzed holistically using cleaned and standardized survey results and by library type clusters. To streamline data analysis in certain locations, an off-shoot CSV file was created so data could be standardized without compromising the integrity of the parent clean file. Three Jupyter Notebooks/Python scripts are available in relation to this project: COVID_Impact_TechnicalServices_HolisticAnalysis (a holistic analysis of all survey data) and COVID_Impact_TechnicalServices_LibraryTypeAnalysis (a clustered analysis of impact by library type, clustered files available as part of the Dataverse for this project).
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains Starlink satellite data in both CSV and TLE formats. At the top level, it includes four files: one set representing a snapshot of all Starlink satellites at a specific time and another set representing a time-range dataset for STARLINK-1008 from March 11 to April 10, 2025. Additionally, there is a folder named STARLINK_INDIVIDUAL_SATELLITE_CSV_TLE_FILES_WITH_TIME_RANGE
, which contains per-satellite data files in both CSV and TLE formats. These cover the time range from January 1, 2024, to June 6, 2025, for individual satellites. The number of files varies as satellites may have been launched at different times within this period.
This dataset contains processed CSV versions of Starlink satellite data originally available from CelesTrak, a publicly available source for satellite orbital information.
CelesTrak publishes satellite position data in TLE (Two-Line Element) format, which describes a satellite’s orbit using two compact lines of text. While TLE is the standard format used by satellite agencies, it is difficult to interpret directly for beginners. So this dataset provides a cleaned and structured CSV version that is easier to use with Python and data science libraries.
Each file in the dataset corresponds to a specific Starlink satellite and contains its orbital data over a range of dates (usually 1 month). Each row is a snapshot of the satellite's position and movement at a given timestamp.
Key columns include:
Column Name | Description |
---|---|
Satellite_Name | Unique identifier for each Starlink satellite. Example: STARLINK-1008 . |
Epoch | The timestamp (in UTC) representing the exact moment when the satellite's orbital data was recorded. |
Inclination_deg | Angle between the satellite’s orbital plane and Earth’s equator. 0° means equatorial orbit; 90° means polar orbit. |
Eccentricity | Describes the shape of the orbit. 0 = perfect circle; values approaching 1 = highly elliptical. |
Mean_Motion_orbits_per_day | Number of orbits the satellite completes around Earth in a single day. |
Altitude_km | Satellite’s altitude above Earth’s surface in kilometers, calculated from orbital parameters. |
Latitude | Satellite’s geographic latitude at the recorded time. Positive = Northern Hemisphere, Negative = Southern Hemisphere. |
Longitude | Satellite’s geographic longitude at the recorded time. Positive = East of Prime Meridian, Negative = West. |
TLE is a compact format used in aerospace and satellite communications, but:
That’s why this dataset presents the same orbital data but in a clean and normalized CSV structure ready for analysis and machine learning.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials
Background
This dataset contains data from monotonic and cyclic loading experiments on structural metallic materials. The materials are primarily structural steels and one iron-based shape memory alloy is also included. Summary files are included that provide an overview of the database and data from the individual experiments is also included.
The files included in the database are outlined below and the format of the files is briefly described. Additional information regarding the formatting can be found through the post-processing library (https://github.com/ahartloper/rlmtp/tree/master/protocols).
Usage
Included Files
File Format: Downsampled Data
These are the "LP_
These data files can be easily loaded using the pandas library in Python through:
import pandas
data = pandas.read_csv(data_file, index_col=0)
The data is formatted so it can be used directly in RESSPyLab (https://github.com/AlbanoCastroSousa/RESSPyLab). Note that the column names "e_true" and "Sigma_true" were kept for backwards compatibility reasons with RESSPyLab.
File Format: Unreduced Data
These are the "LP_
The data can be loaded and used similarly to the downsampled data.
File Format: Overall_Summary
The overall summary file provides data on all the test specimens in the database. The columns include:
File Format: Summarized_Mechanical_Props_Campaign
Meant to be loaded in Python as a pandas DataFrame with multi-indexing, e.g.,
tab1 = pd.read_csv('Summarized_Mechanical_Props_Campaign_' + date + version + '.csv',
index_col=[0, 1, 2, 3], skipinitialspace=True, header=[0, 1],
keep_default_na=False, na_values='')
Caveats
About the dataset (cleaned data)
The dataset (parquet file) contains approximately 1,5 million residential household sales from Denmark during the periode from 1992 to 2024. All cleaned data is merged into one parquet file here on Kaggle. Note some cleaning might still be nessesary, see notebook under code.
Also, added a random sample (100k) of the dataset as a csv file.
Done in Python version: 2.6.3.
Raw data
Raw data and more info is avaible on Github repositary: https://github.com/MartinSamFred/Danish-residential-housingPrices-1992-2024.git
The dataset has been scraped and cleaned (to some extent). Cleaned files are located in: \Housing_data_cleaned \ named DKHousingprices_1 and 2. Saved in parquet format (and saved as two files due to size).
Cleaning from raw files to above cleaned files is outlined in BoligsalgConcatCleanigGit.ipynb. (done in Python version: 2.6.3)
Webscraping script: Webscrape_script.ipynb (done in Python version: 2.6.3)
Provided you want to clean raw files from scratch yourself:
Uncleaned scraped files (81 in total) are located in \Housing_data_raw \ Housing_data_batch1 and 2. Saved in .csv format and compressed as 7-zip files.
Additional files added/appended to the Cleaned files are located in \Addtional_data and named DK_inflation_rates, DK_interest_rates, DK_morgage_rates and DK_regions_zip_codes. Saved in .xlsx format.
Content
Each row in the dataset contains a residential household sale during the period 1992 - 2024.
“Cleaned files” columns:
0 'date': is the transaction date
1 'quarter': is the quarter based on a standard calendar year
2 'house_id': unique house id (could be dropped)
3 'house_type': can be 'Villa', 'Farm', 'Summerhouse', 'Apartment', 'Townhouse'
4 'sales_type': can be 'regular_sale', 'family_sale', 'other_sale', 'auction', '-' (“-“ could be dropped)
5 'year_build': range 1000 to 2024 (could be narrowed more)
6 'purchase_price': is purchase price in DKK
7 '%_change_between_offer_and_purchase': could differ negatively, be zero or positive
8 'no_rooms': number of rooms
9 'sqm': number of square meters
10 'sqm_price': 'purchase_price' divided by 'sqm_price'
11 'address': is the address
12 'zip_code': is the zip code
13 'city': is the city
14 'area': 'East & mid jutland', 'North jutland', 'Other islands', 'Capital, Copenhagen', 'South jutland', 'North Zealand', 'Fyn & islands', 'Bornholm'
15 'region': 'Jutland', 'Zealand', 'Fyn & islands', 'Bornholm'
16 'nom_interest_rate%': Danish nominal interest rate show pr. quarter however actual rate is not converted from annualized to quarterly
17 'dk_ann_infl_rate%': Danish annual inflation rate show pr. quarter however actual rate is not converted from annualized to quarterly
18 'yield_on_mortgage_credit_bonds%': 30 year mortgage bond rate (without spread)
Uses
Various (statistical) analysis, visualisation and I assume machine learning as well.
Practice exercises etc.
Uncleaned scraped files are great to practice cleaning, especially string cleaning. I’m not an expect as seen in the coding ;-).
Disclaimer
The data and information in the data set provided here are intended to be used primarily for educational purposes only. I do not own any data, and all rights are reserved to the respective owners as outlined in “Acknowledgements/sources”. The accuracy of the dataset is not guaranteed accordingly any analysis and/or conclusions is solely at the user's own responsibly and accountability.
Acknowledgements/sources
All data is publicly available on:
Boliga: https://www.boliga.dk/
Finans Danmark: https://finansdanmark.dk/
Danmarks Statistik: https://www.dst.dk/da
Statistikbanken: https://statistikbanken.dk/statbank5a/default.asp?w=2560
Macrotrends: https://www.macrotrends.net/
PostNord: https://www.postnord.dk/
World Data: https://www.worlddata.info/
Dataset picture / cover photo: Nick Karvounis (https://unsplash.com/)
Have fun… :-)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This folder contains the full set of code and data for the CompuCrawl database. The database contains the archived websites of publicly traded North American firms listed in the Compustat database between 1996 and 2020\u2014representing 11,277 firms with 86,303 firm/year observations and 1,617,675 webpages in the final cleaned and selected set.The files are ordered by moment of use in the work flow. For example, the first file in the list is the input file for code files 01 and 02, which create and update the two tracking files "scrapedURLs.csv" and "URLs_1_deeper.csv" and which write HTML files to its folder. "HTML.zip" is the resultant folder, converted to .zip for ease of sharing. Code file 03 then reads this .zip file and is therefore below it in the ordering.The full set of files, in order of use, is as follows:Compustat_2021.xlsx: The input file containing the URLs to be scraped and their date range.01 Collect frontpages.py: Python script scraping the front pages of the list of URLs and generating a list of URLs one page deeper in the domains.URLs_1_deeper.csv: List of URLs one page deeper on the main domains.02 Collect further pages.py: Python script scraping the list of URLs one page deeper in the domains.scrapedURLs.csv: Tracking file containing all URLs that were accessed and their scraping status.HTML.zip: Archived version of the set of individual HTML files.03 Convert HTML to plaintext.py: Python script converting the individual HTML pages to plaintext.TXT_uncleaned.zip: Archived version of the converted yet uncleaned plaintext files.input_categorization_allpages.csv: Input file for classification of pages using GPT according to their HTML title and URL.04 GPT application.py: Python script using OpenAI\u2019s API to classify selected pages according to their HTML title and URL.categorization_applied.csv: Output file containing classification of selected pages.exclusion_list.xlsx: File containing three sheets: 'gvkeys' containing the GVKEYs of duplicate observations (that need to be excluded), 'pages' containing page IDs for pages that should be removed, and 'sentences' containing (sub-)sentences to be removed.05 Clean and select.py: Python script applying data selection and cleaning (including selection based on page category), with setting and decisions described at the top of the script. This script also combined individual pages into one combined observation per GVKEY/year.metadata.csv: Metadata containing information on all processed HTML pages, including those not selected.TXT_cleaned.zip: Archived version of the selected and cleaned plaintext page files. This file serves as input for the word embeddings application.TXT_combined.zip: Archived version of the combined plaintext files at the GVKEY/year level. This file serves as input for the data description using topic modeling.06 Topic model.R: R script that loads up the combined text data from the folder stored in "TXT_combined.zip", applies further cleaning, and estimates a 125-topic model.TM_125.RData: RData file containing the results of the 125-topic model.loadings125.csv: CSV file containing the loadings for all 125 topics for all GVKEY/year observations that were included in the topic model.125_topprob.xlsx: Overview of top-loading terms for the 125 topic model.07 Word2Vec train and align.py: Python script that loads the plaintext files in the "TXT_cleaned.zip" archive to train a series of Word2Vec models and subsequently align them in order to compare word embeddings across time periods.Word2Vec_models.zip: Archived version of the saved Word2Vec models, both unaligned and aligned.08 Word2Vec work with aligned models.py: Python script which loads the trained Word2Vec models to trace the development of the embeddings for the terms \u201csustainability\u201d and \u201cprofitability\u201d over time.99 Scrape further levels down.py: Python script that can be used to generate a list of unscraped URLs from the pages that themselves were one level deeper than the front page.URLs_2_deeper.csv: CSV file containing unscraped URLs from the pages that themselves were one level deeper than the front page.For those only interested in downloading the final database of texts, the files "HTML.zip", "TXT_uncleaned.zip", "TXT_cleaned.zip", and "TXT_combined.zip" contain the full set of HTML pages, the processed but uncleaned texts, the selected and cleaned texts, and combined and cleaned texts at the GVKEY/year level, respectively.The following webpage contains answers to frequently asked questions: https://haans-mertens.github.io/faq/. More information on the database and the underlying project can be found here: https://haans-mertens.github.io/ and the following article: \u201cThe Internet Never Forgets: A Four-Step Scraping Tutorial, Codebase, and Database for Longitudinal Organizational Website Data\u201d, by Richard F.J. Haans and Marc J. Mertens in Organizational Research Methods. The full paper can be accessed here.
This dataset is a refined version of Alpine 1.0. It was created by generating tasks using various LLMs, wrapping them in special elements {Instruction Start} ... {Instruction End}, and saving them in a text file. We then processed this file with a Python script that used regex to extract the tasks into a CSV. Afterward, we cleaned the dataset by removing near-duplicates, vague prompts, and ambiguous entries. python clean.py -i prompts.csv -o cleaned.csv -p "prompt" -t 0.92 -l 30 This dataset… See the full description on the dataset page: https://huggingface.co/datasets/marcuscedricridia/alpine1.1-multireq-instructions-seed.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Formula 1 Comprehensive Dataset (2020-2025)
Dataset Description This comprehensive Formula 1 dataset contains detailed racing data spanning from 2020 to 2025, including race results, qualifying sessions, championship standings, circuit information, and historical driver statistics.
Perfect for:
📊 F1 performance analysis
🤖 Machine learning projects
📈 Data visualization
🏆 Championship predictions
📋 Racing statistics research
📁 Files Included 1. f1_race_results_2020_2025.csv (53 entries) Race winners and results from Grand Prix weekends
Date, Grand Prix name, race winner
Constructor, nationality, grid position
Race time, fastest lap time, points scored
Q1, Q2, Q3 session times
Grid positions, laps completed
Driver and constructor information
Points accumulation over race weekends
Wins, podiums, pole positions tracking
Season-long championship battle data
Constructor points and wins
Team performance metrics
Manufacturer rivalry data
Track length, number of turns
Lap records and record holders
Circuit designers and first F1 usage
Career wins, poles, podiums
Racing entries and achievements
Active and retired driver records
Multiple data types in one file
Ready for immediate analysis
Comprehensive F1 information hub
🔧 Data Features Clean & Structured: All data professionally format
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset and code package supports the reproducible evaluation of Structured Large Language Model (LLM)-generated event messaging using multiple machine learning classifiers, including BERT (via TensorFlow/Keras), XGBoost, and ensemble methods. Including:
The materials accompany the study presented in [Lynch, Christopher, Erik Jensen, Ross Gore, et al. "AI-Generated Messaging for Life Events Using Structured Prompts: A Comparative Study of GPT With Human Experts and Machine Learning." TechRxiv (2025), DOI: https://doi.org/10.36227/techrxiv.174123588.85605769/v1], where Structured Narrative Prompting was applied to generate life-event messages from LLMs, followed by human annotation, and machine learning validation. This release provides complete transparency for reproducing reported metrics and facilitates further benchmarking in multilingual or domain-specific contexts.
Value of the Data: * Enables direct replication of published results across BERT, Keras-based models, XGBoost, and ensemble classifiers. * Provides clean, human-tagged datasets suitable for training, evaluation, and bias analysis. * Offers untagged datasets for new annotation or domain adaptation. * Contains full preprocessing, training, and visualization code in Python and R for flexibility across workflows. * Facilitates extension into other domains (e.g., multilingual LLM messaging validation).
Data Description: * /data/tagged/*.csv – Human-labeled datasets with schema defined in data_dictionary.csv. * /data/untagged/*.csv – Clean datasets without labels for inference or annotation. * /code/python/ – Python scripts for preprocessing, model training (BERT, Keras DNN, XGBoost), ensembling, evaluation metrics, and plotting. * /code/r/ – R scripts for exploratory data analysis, statistical testing, and replication of key figures/tables.
File Formats: * Data: CSV (UTF-8, RFC 4180) * Code: .py, .R, .Rproj
Ethics & Licensing * All data are de-identified and contain no PII. * Released under CC BY 4.0 (data) and MIT License (code).
Limitations * Labels reflect annotator interpretations and may encode bias. * Models trained on English text; generalization to other languages requires adaptation.
Funding Note * Funding sources provided time in support of human taggers annotating the data sets.
The purpose of this data release is to provide data in support of the Bureau of Land Management's (BLM) Reasonably Foreseeable Development (RFD) Scenario by estimating water-use associated with oil and gas extraction methods within the BLM Carlsbad Field Office (CFO) planning area, located in Eddy and Lea Counties as well as part of Chaves County, New Mexico. Three comma separated value files and two python scripts are included in this data release. It was determined that all reported oil and gas wells within Chaves County from the FracFocus and New Mexico Oil Conservation Division (NM OCD) databases were outside of the CFO administration area and were excluded from well_records.csv and modeled_estimates.csv. Data from Chaves County are included in the produced_water.csv file to be consistent with the BLM’s water support document. Data were synthesized into comma separated values which include, produced_water.csv (volume) from NM OCD, well_records.csv (including location and completion) from NM OCD and FracFocus, and modeled_estimates.csv (using FracFocus as well as Ball and others (2020) as input data). The results from modeled_estimates.csv were obtained using a previously published regression model (McShane and McDowell, 2021) to estimate water use associated with unconventional oil and gas activities in the Permian Basin (Valder and others, 2021) for the period of interest (2010-2021). Additionally, python scripts to process, clean, and categorize FracFocus data are provided in this data release.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
See details from OpenAI: https://openai.com/index/introducing-swe-bench-verified/
Converted from Parquet to CSV from https://huggingface.co/datasets/princeton-nlp/SWE-bench_Verified
Data Summary from Huggingface:
SWE-bench Verified is a subset of 500 samples from the SWE-bench test set, which have been human-validated for quality. SWE-bench is a dataset that tests systems’ ability to solve GitHub issues automatically. See this post for more details on the human-validation process.
The dataset collects 500 test Issue-Pull Request pairs from popular Python repositories. Evaluation is performed by unit test verification using post-PR behavior as the reference solution.
The original SWE-bench dataset was released as part of SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Want to run inference now? This dataset only contains the problem_statement (i.e. issue text) and the base_commit which represents the state of the codebase before the issue has been resolved. If you want to run inference using the "Oracle" or BM25 retrieval settings mentioned in the paper, consider the following datasets.
princeton-nlp/SWE-bench_Lite_oracle
princeton-nlp/SWE-bench_Lite_bm25_13K
princeton-nlp/SWE-bench_Lite_bm25_27K
Supported Tasks and Leaderboards SWE-bench proposes a new task: issue resolution provided a full repository and GitHub issue. The leaderboard can be found at www.swebench.com
Languages The text of the dataset is primarily English, but we make no effort to filter or otherwise clean based on language type.
Dataset Structure
An example of a SWE-bench datum is as follows:
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
AbstractThe CMU Wilderness Multilingual Speech Dataset is a newly published multilingual speech dataset based on recorded readings of the New Testament. It provides data to build Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) models for potentially 700 languages. However, the fact that the source content (the Bible), is the same for all the languages is not exploited to date. Therefore, this article proposes to add multilingual links between speech segments in different languages, and shares a large and clean dataset of 8,130 para-lel spoken utterances across 8 languages (56 language pairs).We name this corpus MaSS (Multilingual corpus of Sentence-aligned Spoken utterances). The covered languages (Basque, English, Finnish, French, Hungarian, Romanian, Russian and Spanish) allow researches on speech-to-speech alignment as well as on translation for syntactically divergent language pairs. The quality of the final corpus is attested by human evaluation performed on a corpus subset (100 utterances, 8 language pairs).Paper | GitHub Repository containing the scripts needed to build the data set from scratch (if needed)Project structureThis repository contains 8 Numpy files, one for each featured language, pickled with Python 3.6. Each line corresponds to the spectrogram of the file mentioned in the file verses.csv. There is a direct mapping between the ID of the verse and its index in the list (thus verse with ID 5634 is located at index 5634 in the Numpy file). Verses not available for a given language (as stated by the value "Not Available" in the CSV file) are represented by empty lists in the Numpy files, thus ensuring a perfect verse-to-verse alignement between each file.Spectrogram were extracted using Librosa with the following parameters:Pre-emphasis = 0.97Sample rate = 16000Window size = 0.025Window stride = 0.01Window type = 'hamming'Mel coefficients = 40Min frequency = 20
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The repository contains data on party strength for each state as shown on each state's corresponding party strength Wikipedia page (for example, here is Virginia )
Each state has a table of a detailed summary of the state of its governing and representing bodies on Wikipedia but there is no data set that collates these entries. I scraped each state's Wikipedia table and collated the entries into a single dataset. The data are stored in the state_party_strength.csv and state_party_strength_cleaned.csv. The code that generated the file can be found in corresponding Python notebooks.
The data contain information from 1980 on each state's: 1. governor and party 2. state house and senate composition 3. state representative composition in congress 4. electoral votes
Data in the clean version has been cleaned and processed substantially. Namely: - all columns now contain homogenous data within the column - names and Wiki-citations have been removed - only the party counts and party identification have been left The notebook that created this file is here
The data contained herein have not been altered from their Wikipedia tables except in two instances: - Forced column names to be in accord across states - Any needed data modifications (ie concatenated string columns) to retain information when combining columns
Please note that the right encoding for the dataset is "ISO-8859-1", not 'utf-8' though in future versions I will try to fix that to make it more accessible.
This means that you will likely have to perform further data wrangling prior to doing any substantive analysis. The notebook that has been used to create this data file is located here
The raw scraped data can be found in the pickle. This file contains a Python dictionary where each key is a US state name and each element is the raw scraped table in Pandas DataFrame format.
Hope it proves as useful to you in analyzing/using political patterns at the state level in the US for political and policy research.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The archive contains simulator data in csv format and routine in python enabling their post traitement and plotting. A "README" file explains how to use these routines. These data have been stored during the final EFAICTS project evaluations and used in a publication also available on zenodo: 10.5281/zenodo.6796534
Not seeing a result you expected?
Learn how you can add new datasets to our index.
This Python program simulates an automatic vacuum cleaner in a room using a dataset. The vacuum cleaner detects dirt and obstacles, cleans the dirt, and avoids obstacles. The program reads the room layout from a CSV file, processes each cell to check for dirt or obstacles, and updates the room status accordingly