14 datasets found
  1. automatic dirt detection vacuum cleaner

    • kaggle.com
    Updated Jun 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    bisma nawal (2024). automatic dirt detection vacuum cleaner [Dataset]. https://www.kaggle.com/datasets/bismanawal/automatic-dirt-detection-vacuum-cleaner
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 28, 2024
    Dataset provided by
    Kaggle
    Authors
    bisma nawal
    Description

    This Python program simulates an automatic vacuum cleaner in a room using a dataset. The vacuum cleaner detects dirt and obstacles, cleans the dirt, and avoids obstacles. The program reads the room layout from a CSV file, processes each cell to check for dirt or obstacles, and updates the room status accordingly

  2. Python Codes for Data Analysis of The Impact of COVID-19 on Technical...

    • figshare.com
    • dataverse.harvard.edu
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elizabeth Szkirpan (2022). Python Codes for Data Analysis of The Impact of COVID-19 on Technical Services Units Survey Results [Dataset]. http://doi.org/10.6084/m9.figshare.20416092.v1
    Explore at:
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Elizabeth Szkirpan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Copies of Anaconda 3 Jupyter Notebooks and Python script for holistic and clustered analysis of "The Impact of COVID-19 on Technical Services Units" survey results. Data was analyzed holistically using cleaned and standardized survey results and by library type clusters. To streamline data analysis in certain locations, an off-shoot CSV file was created so data could be standardized without compromising the integrity of the parent clean file. Three Jupyter Notebooks/Python scripts are available in relation to this project: COVID_Impact_TechnicalServices_HolisticAnalysis (a holistic analysis of all survey data) and COVID_Impact_TechnicalServices_LibraryTypeAnalysis (a clustered analysis of impact by library type, clustered files available as part of the Dataverse for this project).

  3. Starlink Satellite TLE/CSV dataset (April 2025)

    • kaggle.com
    Updated Jun 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vijay J0shi (2025). Starlink Satellite TLE/CSV dataset (April 2025) [Dataset]. https://www.kaggle.com/datasets/vijayj0shi/starlink-satellite-tlecsv-dataset-april-2025
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 9, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Vijay J0shi
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    CelesTrak Starlink TLE Data (CSV Format)

    About This Dataset

    This dataset contains Starlink satellite data in both CSV and TLE formats. At the top level, it includes four files: one set representing a snapshot of all Starlink satellites at a specific time and another set representing a time-range dataset for STARLINK-1008 from March 11 to April 10, 2025. Additionally, there is a folder named STARLINK_INDIVIDUAL_SATELLITE_CSV_TLE_FILES_WITH_TIME_RANGE, which contains per-satellite data files in both CSV and TLE formats. These cover the time range from January 1, 2024, to June 6, 2025, for individual satellites. The number of files varies as satellites may have been launched at different times within this period.

    This dataset contains processed CSV versions of Starlink satellite data originally available from CelesTrak, a publicly available source for satellite orbital information.

    CelesTrak publishes satellite position data in TLE (Two-Line Element) format, which describes a satellite’s orbit using two compact lines of text. While TLE is the standard format used by satellite agencies, it is difficult to interpret directly for beginners. So this dataset provides a cleaned and structured CSV version that is easier to use with Python and data science libraries.

    What's Inside

    Each file in the dataset corresponds to a specific Starlink satellite and contains its orbital data over a range of dates (usually 1 month). Each row is a snapshot of the satellite's position and movement at a given timestamp.

    Key columns include:

    Column NameDescription
    Satellite_NameUnique identifier for each Starlink satellite. Example: STARLINK-1008.
    EpochThe timestamp (in UTC) representing the exact moment when the satellite's orbital data was recorded.
    Inclination_degAngle between the satellite’s orbital plane and Earth’s equator. 0° means equatorial orbit; 90° means polar orbit.
    EccentricityDescribes the shape of the orbit. 0 = perfect circle; values approaching 1 = highly elliptical.
    Mean_Motion_orbits_per_dayNumber of orbits the satellite completes around Earth in a single day.
    Altitude_kmSatellite’s altitude above Earth’s surface in kilometers, calculated from orbital parameters.
    LatitudeSatellite’s geographic latitude at the recorded time. Positive = Northern Hemisphere, Negative = Southern Hemisphere.
    LongitudeSatellite’s geographic longitude at the recorded time. Positive = East of Prime Meridian, Negative = West.

    Why CSV?

    TLE is a compact format used in aerospace and satellite communications, but:

    • It is not beginner-friendly.
    • It requires a dedicated parser.
    • It’s difficult to visualize or analyze directly.

    That’s why this dataset presents the same orbital data but in a clean and normalized CSV structure ready for analysis and machine learning.

    Use Cases

    • Satellite orbit visualization.
    • Time-series analysis of Starlink constellations.
    • Anomaly detection (e.g., using autoencoders or clustering).
    • Feature engineering for orbit-based models.
    • Educational projects for learning satellite mechanics.

    Data Source

    • Source: CelesTrak Starlink TLE Feed
    • Converted to CSV using custom Python scripts.
    • Time range: Typically one month per file (can vary).
  4. Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic...

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv, zip
    Updated Dec 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexander R. Hartloper; Alexander R. Hartloper; Selimcan Ozden; Albano de Castro e Sousa; Dimitrios G. Lignos; Dimitrios G. Lignos; Selimcan Ozden; Albano de Castro e Sousa (2022). Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials [Dataset]. http://doi.org/10.5281/zenodo.6965147
    Explore at:
    bin, zip, csvAvailable download formats
    Dataset updated
    Dec 24, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Alexander R. Hartloper; Alexander R. Hartloper; Selimcan Ozden; Albano de Castro e Sousa; Dimitrios G. Lignos; Dimitrios G. Lignos; Selimcan Ozden; Albano de Castro e Sousa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials

    Background

    This dataset contains data from monotonic and cyclic loading experiments on structural metallic materials. The materials are primarily structural steels and one iron-based shape memory alloy is also included. Summary files are included that provide an overview of the database and data from the individual experiments is also included.

    The files included in the database are outlined below and the format of the files is briefly described. Additional information regarding the formatting can be found through the post-processing library (https://github.com/ahartloper/rlmtp/tree/master/protocols).

    Usage

    • The data is licensed through the Creative Commons Attribution 4.0 International.
    • If you have used our data and are publishing your work, we ask that you please reference both:
      1. this database through its DOI, and
      2. any publication that is associated with the experiments. See the Overall_Summary and Database_References files for the associated publication references.

    Included Files

    • Overall_Summary_2022-08-25_v1-0-0.csv: summarises the specimen information for all experiments in the database.
    • Summarized_Mechanical_Props_Campaign_2022-08-25_v1-0-0.csv: summarises the average initial yield stress and average initial elastic modulus per campaign.
    • Unreduced_Data-#_v1-0-0.zip: contain the original (not downsampled) data
      • Where # is one of: 1, 2, 3, 4, 5, 6. The unreduced data is broken into separate archives because of upload limitations to Zenodo. Together they provide all the experimental data.
      • We recommend you un-zip all the folders and place them in one "Unreduced_Data" directory similar to the "Clean_Data"
      • The experimental data is provided through .csv files for each test that contain the processed data. The experiments are organised by experimental campaign and named by load protocol and specimen. A .pdf file accompanies each test showing the stress-strain graph.
      • There is a "db_tag_clean_data_map.csv" file that is used to map the database summary with the unreduced data.
      • The computed yield stresses and elastic moduli are stored in the "yield_stress" directory.
    • Clean_Data_v1-0-0.zip: contains all the downsampled data
      • The experimental data is provided through .csv files for each test that contain the processed data. The experiments are organised by experimental campaign and named by load protocol and specimen. A .pdf file accompanies each test showing the stress-strain graph.
      • There is a "db_tag_clean_data_map.csv" file that is used to map the database summary with the clean data.
      • The computed yield stresses and elastic moduli are stored in the "yield_stress" directory.
    • Database_References_v1-0-0.bib
      • Contains a bibtex reference for many of the experiments in the database. Corresponds to the "citekey" entry in the summary files.

    File Format: Downsampled Data

    These are the "LP_

    • The header of the first column is empty: the first column corresponds to the index of the sample point in the original (unreduced) data
    • Time[s]: time in seconds since the start of the test
    • e_true: true strain
    • Sigma_true: true stress in MPa
    • (optional) Temperature[C]: the surface temperature in degC

    These data files can be easily loaded using the pandas library in Python through:

    import pandas
    data = pandas.read_csv(data_file, index_col=0)

    The data is formatted so it can be used directly in RESSPyLab (https://github.com/AlbanoCastroSousa/RESSPyLab). Note that the column names "e_true" and "Sigma_true" were kept for backwards compatibility reasons with RESSPyLab.

    File Format: Unreduced Data

    These are the "LP_

    • The first column is the index of each data point
    • S/No: sample number recorded by the DAQ
    • System Date: Date and time of sample
    • Time[s]: time in seconds since the start of the test
    • C_1_Force[kN]: load cell force
    • C_1_Déform1[mm]: extensometer displacement
    • C_1_Déplacement[mm]: cross-head displacement
    • Eng_Stress[MPa]: engineering stress
    • Eng_Strain[]: engineering strain
    • e_true: true strain
    • Sigma_true: true stress in MPa
    • (optional) Temperature[C]: specimen surface temperature in degC

    The data can be loaded and used similarly to the downsampled data.

    File Format: Overall_Summary

    The overall summary file provides data on all the test specimens in the database. The columns include:

    • hidden_index: internal reference ID
    • grade: material grade
    • spec: specifications for the material
    • source: base material for the test specimen
    • id: internal name for the specimen
    • lp: load protocol
    • size: type of specimen (M8, M12, M20)
    • gage_length_mm_: unreduced section length in mm
    • avg_reduced_dia_mm_: average measured diameter for the reduced section in mm
    • avg_fractured_dia_top_mm_: average measured diameter of the top fracture surface in mm
    • avg_fractured_dia_bot_mm_: average measured diameter of the bottom fracture surface in mm
    • fy_n_mpa_: nominal yield stress
    • fu_n_mpa_: nominal ultimate stress
    • t_a_deg_c_: ambient temperature in degC
    • date: date of test
    • investigator: person(s) who conducted the test
    • location: laboratory where test was conducted
    • machine: setup used to conduct test
    • pid_force_k_p, pid_force_t_i, pid_force_t_d: PID parameters for force control
    • pid_disp_k_p, pid_disp_t_i, pid_disp_t_d: PID parameters for displacement control
    • pid_extenso_k_p, pid_extenso_t_i, pid_extenso_t_d: PID parameters for extensometer control
    • citekey: reference corresponding to the Database_References.bib file
    • yield_stress_mpa_: computed yield stress in MPa
    • elastic_modulus_mpa_: computed elastic modulus in MPa
    • fracture_strain: computed average true strain across the fracture surface
    • c,si,mn,p,s,n,cu,mo,ni,cr,v,nb,ti,al,b,zr,sn,ca,h,fe: chemical compositions in units of %mass
    • file: file name of corresponding clean (downsampled) stress-strain data

    File Format: Summarized_Mechanical_Props_Campaign

    Meant to be loaded in Python as a pandas DataFrame with multi-indexing, e.g.,

    tab1 = pd.read_csv('Summarized_Mechanical_Props_Campaign_' + date + version + '.csv',
              index_col=[0, 1, 2, 3], skipinitialspace=True, header=[0, 1],
              keep_default_na=False, na_values='')
    • citekey: reference in "Campaign_References.bib".
    • Grade: material grade.
    • Spec.: specifications (e.g., J2+N).
    • Yield Stress [MPa]: initial yield stress in MPa
      • size, count, mean, coefvar: number of experiments in campaign, number of experiments in mean, mean value for campaign, coefficient of variation for campaign
    • Elastic Modulus [MPa]: initial elastic modulus in MPa
      • size, count, mean, coefvar: number of experiments in campaign, number of experiments in mean, mean value for campaign, coefficient of variation for campaign

    Caveats

    • The files in the following directories were tested before the protocol was established. Therefore, only the true stress-strain is available for each:
      • A500
      • A992_Gr50
      • BCP325
      • BCR295
      • HYP400
      • S460NL
      • S690QL/25mm
      • S355J2_Plates/S355J2_N_25mm and S355J2_N_50mm
  5. Danish Residential Housing Prices 1992-2024

    • kaggle.com
    Updated Nov 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Frederiksen (2024). Danish Residential Housing Prices 1992-2024 [Dataset]. https://www.kaggle.com/datasets/martinfrederiksen/danish-residential-housing-prices-1992-2024
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 29, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Martin Frederiksen
    Description

    Danish residential house prices (1992-2024)

    About the dataset (cleaned data)

    The dataset (parquet file) contains approximately 1,5 million residential household sales from Denmark during the periode from 1992 to 2024. All cleaned data is merged into one parquet file here on Kaggle. Note some cleaning might still be nessesary, see notebook under code.

    Also, added a random sample (100k) of the dataset as a csv file.

    Done in Python version: 2.6.3.

    Raw data

    Raw data and more info is avaible on Github repositary: https://github.com/MartinSamFred/Danish-residential-housingPrices-1992-2024.git

    The dataset has been scraped and cleaned (to some extent). Cleaned files are located in: \Housing_data_cleaned \ named DKHousingprices_1 and 2. Saved in parquet format (and saved as two files due to size).

    Cleaning from raw files to above cleaned files is outlined in BoligsalgConcatCleanigGit.ipynb. (done in Python version: 2.6.3)

    Webscraping script: Webscrape_script.ipynb (done in Python version: 2.6.3)

    Provided you want to clean raw files from scratch yourself:

    Uncleaned scraped files (81 in total) are located in \Housing_data_raw \ Housing_data_batch1 and 2. Saved in .csv format and compressed as 7-zip files.

    Additional files added/appended to the Cleaned files are located in \Addtional_data and named DK_inflation_rates, DK_interest_rates, DK_morgage_rates and DK_regions_zip_codes. Saved in .xlsx format.

    Content

    Each row in the dataset contains a residential household sale during the period 1992 - 2024.

    “Cleaned files” columns:

    0 'date': is the transaction date

    1 'quarter': is the quarter based on a standard calendar year

    2 'house_id': unique house id (could be dropped)

    3 'house_type': can be 'Villa', 'Farm', 'Summerhouse', 'Apartment', 'Townhouse'

    4 'sales_type': can be 'regular_sale', 'family_sale', 'other_sale', 'auction', '-' (“-“ could be dropped)

    5 'year_build': range 1000 to 2024 (could be narrowed more)

    6 'purchase_price': is purchase price in DKK

    7 '%_change_between_offer_and_purchase': could differ negatively, be zero or positive

    8 'no_rooms': number of rooms

    9 'sqm': number of square meters

    10 'sqm_price': 'purchase_price' divided by 'sqm_price'

    11 'address': is the address

    12 'zip_code': is the zip code

    13 'city': is the city

    14 'area': 'East & mid jutland', 'North jutland', 'Other islands', 'Capital, Copenhagen', 'South jutland', 'North Zealand', 'Fyn & islands', 'Bornholm'

    15 'region': 'Jutland', 'Zealand', 'Fyn & islands', 'Bornholm'

    16 'nom_interest_rate%': Danish nominal interest rate show pr. quarter however actual rate is not converted from annualized to quarterly

    17 'dk_ann_infl_rate%': Danish annual inflation rate show pr. quarter however actual rate is not converted from annualized to quarterly

    18 'yield_on_mortgage_credit_bonds%': 30 year mortgage bond rate (without spread)

    Uses

    Various (statistical) analysis, visualisation and I assume machine learning as well.

    Practice exercises etc.

    Uncleaned scraped files are great to practice cleaning, especially string cleaning. I’m not an expect as seen in the coding ;-).

    Disclaimer

    The data and information in the data set provided here are intended to be used primarily for educational purposes only. I do not own any data, and all rights are reserved to the respective owners as outlined in “Acknowledgements/sources”. The accuracy of the dataset is not guaranteed accordingly any analysis and/or conclusions is solely at the user's own responsibly and accountability.

    Acknowledgements/sources

    All data is publicly available on:

    Boliga: https://www.boliga.dk/

    Finans Danmark: https://finansdanmark.dk/

    Danmarks Statistik: https://www.dst.dk/da

    Statistikbanken: https://statistikbanken.dk/statbank5a/default.asp?w=2560

    Macrotrends: https://www.macrotrends.net/

    PostNord: https://www.postnord.dk/

    World Data: https://www.worlddata.info/

    Dataset picture / cover photo: Nick Karvounis (https://unsplash.com/)

    Have fun… :-)

  6. D

    CompuCrawl: Full database and code

    • dataverse.nl
    Updated Sep 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Richard Haans; Richard Haans (2025). CompuCrawl: Full database and code [Dataset]. http://doi.org/10.34894/OBVAOY
    Explore at:
    Dataset updated
    Sep 23, 2025
    Dataset provided by
    DataverseNL
    Authors
    Richard Haans; Richard Haans
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This folder contains the full set of code and data for the CompuCrawl database. The database contains the archived websites of publicly traded North American firms listed in the Compustat database between 1996 and 2020\u2014representing 11,277 firms with 86,303 firm/year observations and 1,617,675 webpages in the final cleaned and selected set.The files are ordered by moment of use in the work flow. For example, the first file in the list is the input file for code files 01 and 02, which create and update the two tracking files "scrapedURLs.csv" and "URLs_1_deeper.csv" and which write HTML files to its folder. "HTML.zip" is the resultant folder, converted to .zip for ease of sharing. Code file 03 then reads this .zip file and is therefore below it in the ordering.The full set of files, in order of use, is as follows:Compustat_2021.xlsx: The input file containing the URLs to be scraped and their date range.01 Collect frontpages.py: Python script scraping the front pages of the list of URLs and generating a list of URLs one page deeper in the domains.URLs_1_deeper.csv: List of URLs one page deeper on the main domains.02 Collect further pages.py: Python script scraping the list of URLs one page deeper in the domains.scrapedURLs.csv: Tracking file containing all URLs that were accessed and their scraping status.HTML.zip: Archived version of the set of individual HTML files.03 Convert HTML to plaintext.py: Python script converting the individual HTML pages to plaintext.TXT_uncleaned.zip: Archived version of the converted yet uncleaned plaintext files.input_categorization_allpages.csv: Input file for classification of pages using GPT according to their HTML title and URL.04 GPT application.py: Python script using OpenAI\u2019s API to classify selected pages according to their HTML title and URL.categorization_applied.csv: Output file containing classification of selected pages.exclusion_list.xlsx: File containing three sheets: 'gvkeys' containing the GVKEYs of duplicate observations (that need to be excluded), 'pages' containing page IDs for pages that should be removed, and 'sentences' containing (sub-)sentences to be removed.05 Clean and select.py: Python script applying data selection and cleaning (including selection based on page category), with setting and decisions described at the top of the script. This script also combined individual pages into one combined observation per GVKEY/year.metadata.csv: Metadata containing information on all processed HTML pages, including those not selected.TXT_cleaned.zip: Archived version of the selected and cleaned plaintext page files. This file serves as input for the word embeddings application.TXT_combined.zip: Archived version of the combined plaintext files at the GVKEY/year level. This file serves as input for the data description using topic modeling.06 Topic model.R: R script that loads up the combined text data from the folder stored in "TXT_combined.zip", applies further cleaning, and estimates a 125-topic model.TM_125.RData: RData file containing the results of the 125-topic model.loadings125.csv: CSV file containing the loadings for all 125 topics for all GVKEY/year observations that were included in the topic model.125_topprob.xlsx: Overview of top-loading terms for the 125 topic model.07 Word2Vec train and align.py: Python script that loads the plaintext files in the "TXT_cleaned.zip" archive to train a series of Word2Vec models and subsequently align them in order to compare word embeddings across time periods.Word2Vec_models.zip: Archived version of the saved Word2Vec models, both unaligned and aligned.08 Word2Vec work with aligned models.py: Python script which loads the trained Word2Vec models to trace the development of the embeddings for the terms \u201csustainability\u201d and \u201cprofitability\u201d over time.99 Scrape further levels down.py: Python script that can be used to generate a list of unscraped URLs from the pages that themselves were one level deeper than the front page.URLs_2_deeper.csv: CSV file containing unscraped URLs from the pages that themselves were one level deeper than the front page.For those only interested in downloading the final database of texts, the files "HTML.zip", "TXT_uncleaned.zip", "TXT_cleaned.zip", and "TXT_combined.zip" contain the full set of HTML pages, the processed but uncleaned texts, the selected and cleaned texts, and combined and cleaned texts at the GVKEY/year level, respectively.The following webpage contains answers to frequently asked questions: https://haans-mertens.github.io/faq/. More information on the database and the underlying project can be found here: https://haans-mertens.github.io/ and the following article: \u201cThe Internet Never Forgets: A Four-Step Scraping Tutorial, Codebase, and Database for Longitudinal Organizational Website Data\u201d, by Richard F.J. Haans and Marc J. Mertens in Organizational Research Methods. The full paper can be accessed here.

  7. h

    alpine1.1-multireq-instructions-seed

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marcus Cedric R. Idia, alpine1.1-multireq-instructions-seed [Dataset]. https://huggingface.co/datasets/marcuscedricridia/alpine1.1-multireq-instructions-seed
    Explore at:
    Authors
    Marcus Cedric R. Idia
    Description

    This dataset is a refined version of Alpine 1.0. It was created by generating tasks using various LLMs, wrapping them in special elements {Instruction Start} ... {Instruction End}, and saving them in a text file. We then processed this file with a Python script that used regex to extract the tasks into a CSV. Afterward, we cleaned the dataset by removing near-duplicates, vague prompts, and ambiguous entries. python clean.py -i prompts.csv -o cleaned.csv -p "prompt" -t 0.92 -l 30 This dataset… See the full description on the dataset page: https://huggingface.co/datasets/marcuscedricridia/alpine1.1-multireq-instructions-seed.

  8. Comprehensive Formula 1 Dataset (2020-2025)

    • kaggle.com
    Updated Jul 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    V SHREE KAMALESH (2025). Comprehensive Formula 1 Dataset (2020-2025) [Dataset]. https://www.kaggle.com/datasets/vshreekamalesh/comprehensive-formula-1-dataset-2020-2025
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 27, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    V SHREE KAMALESH
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Formula 1 Comprehensive Dataset (2020-2025)

    Dataset Description This comprehensive Formula 1 dataset contains detailed racing data spanning from 2020 to 2025, including race results, qualifying sessions, championship standings, circuit information, and historical driver statistics.

    Perfect for:

    📊 F1 performance analysis

    🤖 Machine learning projects

    📈 Data visualization

    🏆 Championship predictions

    📋 Racing statistics research

    📁 Files Included 1. f1_race_results_2020_2025.csv (53 entries) Race winners and results from Grand Prix weekends

    Date, Grand Prix name, race winner

    Constructor, nationality, grid position

    Race time, fastest lap time, points scored

    1. f1_qualifying_results_2020_2024.csv (820 entries) Qualifying session results with timing data

    Q1, Q2, Q3 session times

    Grid positions, laps completed

    Driver and constructor information

    1. f1_driver_standings_progressive.csv (600 entries) Championship standings progression throughout seasons

    Points accumulation over race weekends

    Wins, podiums, pole positions tracking

    Season-long championship battle data

    1. f1_constructor_standings_progressive.csv (360 entries) Team championship standings evolution

    Constructor points and wins

    Team performance metrics

    Manufacturer rivalry data

    1. f1_circuits_technical_data.csv (24 entries) Technical specifications for all F1 circuits

    Track length, number of turns

    Lap records and record holders

    Circuit designers and first F1 usage

    1. f1_historical_driver_statistics.csv (30 entries) All-time career statistics for F1 drivers

    Career wins, poles, podiums

    Racing entries and achievements

    Active and retired driver records

    1. f1_comprehensive_dataset_2020_2025.csv (432 entries) MAIN DATASET - Combined data from all sources

    Multiple data types in one file

    Ready for immediate analysis

    Comprehensive F1 information hub

    🔧 Data Features Clean & Structured: All data professionally format

  9. m

    ML Classifiers, Human-Tagged Datasets, and Validation Code for Structured...

    • data.mendeley.com
    Updated Aug 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christopher Lynch (2025). ML Classifiers, Human-Tagged Datasets, and Validation Code for Structured LLM-Generated Event Messaging: BERT, Keras, XGBoost, and Ensemble Methods [Dataset]. http://doi.org/10.17632/g2sdzmssgh.1
    Explore at:
    Dataset updated
    Aug 15, 2025
    Authors
    Christopher Lynch
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset and code package supports the reproducible evaluation of Structured Large Language Model (LLM)-generated event messaging using multiple machine learning classifiers, including BERT (via TensorFlow/Keras), XGBoost, and ensemble methods. Including:

    • Tagged datasets (.csv): human-tagged gold labels for evaluation
    • Untagged datasets (.csv): raw data with Prompt matched to corresponding LLM-generated narrative
      • Suitable for inference, semi-automatic labeling, or transfer learning
    • Python and R code for preprocessing, model training, evaluation, and visualization
    • Configuration files and environment specifications to enable end-to-end reproducibility

    The materials accompany the study presented in [Lynch, Christopher, Erik Jensen, Ross Gore, et al. "AI-Generated Messaging for Life Events Using Structured Prompts: A Comparative Study of GPT With Human Experts and Machine Learning." TechRxiv (2025), DOI: https://doi.org/10.36227/techrxiv.174123588.85605769/v1], where Structured Narrative Prompting was applied to generate life-event messages from LLMs, followed by human annotation, and machine learning validation. This release provides complete transparency for reproducing reported metrics and facilitates further benchmarking in multilingual or domain-specific contexts.

    Value of the Data: * Enables direct replication of published results across BERT, Keras-based models, XGBoost, and ensemble classifiers. * Provides clean, human-tagged datasets suitable for training, evaluation, and bias analysis. * Offers untagged datasets for new annotation or domain adaptation. * Contains full preprocessing, training, and visualization code in Python and R for flexibility across workflows. * Facilitates extension into other domains (e.g., multilingual LLM messaging validation).

    Data Description: * /data/tagged/*.csv – Human-labeled datasets with schema defined in data_dictionary.csv. * /data/untagged/*.csv – Clean datasets without labels for inference or annotation. * /code/python/ – Python scripts for preprocessing, model training (BERT, Keras DNN, XGBoost), ensembling, evaluation metrics, and plotting. * /code/r/ – R scripts for exploratory data analysis, statistical testing, and replication of key figures/tables.

    File Formats: * Data: CSV (UTF-8, RFC 4180) * Code: .py, .R, .Rproj

    Ethics & Licensing * All data are de-identified and contain no PII. * Released under CC BY 4.0 (data) and MIT License (code).

    Limitations * Labels reflect annotator interpretations and may encode bias. * Models trained on English text; generalization to other languages requires adaptation.

    Funding Note * Funding sources provided time in support of human taggers annotating the data sets.

  10. d

    Data from: Data to Estimate Water Use Associated with Oil and Gas...

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Oct 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Data to Estimate Water Use Associated with Oil and Gas Development within the Bureau of Land Management Carlsbad Field Office Area, New Mexico [Dataset]. https://catalog.data.gov/dataset/data-to-estimate-water-use-associated-with-oil-and-gas-development-within-the-bureau-of-la
    Explore at:
    Dataset updated
    Oct 1, 2025
    Dataset provided by
    U.S. Geological Survey
    Area covered
    New Mexico
    Description

    The purpose of this data release is to provide data in support of the Bureau of Land Management's (BLM) Reasonably Foreseeable Development (RFD) Scenario by estimating water-use associated with oil and gas extraction methods within the BLM Carlsbad Field Office (CFO) planning area, located in Eddy and Lea Counties as well as part of Chaves County, New Mexico. Three comma separated value files and two python scripts are included in this data release. It was determined that all reported oil and gas wells within Chaves County from the FracFocus and New Mexico Oil Conservation Division (NM OCD) databases were outside of the CFO administration area and were excluded from well_records.csv and modeled_estimates.csv. Data from Chaves County are included in the produced_water.csv file to be consistent with the BLM’s water support document. Data were synthesized into comma separated values which include, produced_water.csv (volume) from NM OCD, well_records.csv (including location and completion) from NM OCD and FracFocus, and modeled_estimates.csv (using FracFocus as well as Ball and others (2020) as input data). The results from modeled_estimates.csv were obtained using a previously published regression model (McShane and McDowell, 2021) to estimate water use associated with unconventional oil and gas activities in the Permian Basin (Valder and others, 2021) for the period of interest (2010-2021). Additionally, python scripts to process, clean, and categorize FracFocus data are provided in this data release.

  11. SWE Bench Verified

    • kaggle.com
    • huggingface.co
    Updated Aug 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harry Wang (2024). SWE Bench Verified [Dataset]. https://www.kaggle.com/datasets/harrywang/swe-bench-verified
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 20, 2024
    Dataset provided by
    Kaggle
    Authors
    Harry Wang
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    See details from OpenAI: https://openai.com/index/introducing-swe-bench-verified/

    Converted from Parquet to CSV from https://huggingface.co/datasets/princeton-nlp/SWE-bench_Verified

    Data Summary from Huggingface:

    SWE-bench Verified is a subset of 500 samples from the SWE-bench test set, which have been human-validated for quality. SWE-bench is a dataset that tests systems’ ability to solve GitHub issues automatically. See this post for more details on the human-validation process.

    The dataset collects 500 test Issue-Pull Request pairs from popular Python repositories. Evaluation is performed by unit test verification using post-PR behavior as the reference solution.

    The original SWE-bench dataset was released as part of SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Want to run inference now? This dataset only contains the problem_statement (i.e. issue text) and the base_commit which represents the state of the codebase before the issue has been resolved. If you want to run inference using the "Oracle" or BM25 retrieval settings mentioned in the paper, consider the following datasets.

    princeton-nlp/SWE-bench_Lite_oracle

    princeton-nlp/SWE-bench_Lite_bm25_13K

    princeton-nlp/SWE-bench_Lite_bm25_27K

    Supported Tasks and Leaderboards SWE-bench proposes a new task: issue resolution provided a full repository and GitHub issue. The leaderboard can be found at www.swebench.com

    Languages The text of the dataset is primarily English, but we make no effort to filter or otherwise clean based on language type.

    Dataset Structure

    An example of a SWE-bench datum is as follows:

    • instance_id: (str) - A formatted instance identifier, usually as repo_owner_repo_name-PR-number.
    • patch: (str) - The gold patch, the patch generated by the PR (minus test-related code), that resolved the issue.
    • repo: (str) - The repository owner/name identifier from GitHub.
    • base_commit: (str) - The commit hash of the repository representing the HEAD of the repository before the solution PR is applied.
    • hints_text: (str) - Comments made on the issue prior to the creation of the solution PR’s first commit creation date.
    • created_at: (str) - The creation date of the pull request.
    • test_patch: (str) - A test-file patch that was contributed by the solution PR. problem_statement: (str) - The issue title and body.
    • version: (str) - Installation version to use for running evaluation.
    • environment_setup_commit: (str) - commit hash to use for environment setup and installation.
    • FAIL_TO_PASS: (str) - A json list of strings that represent the set of tests resolved by the PR and tied to the issue resolution.
    • PASS_TO_PASS: (str) - A json list of strings that represent tests that should pass before and after the PR application.
  12. E

    MaSS - Multilingual corpus of Sentence-aligned Spoken utterances

    • live.european-language-grid.eu
    • zenodo.org
    npy
    Updated Aug 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). MaSS - Multilingual corpus of Sentence-aligned Spoken utterances [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7722
    Explore at:
    npyAvailable download formats
    Dataset updated
    Aug 28, 2022
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    AbstractThe CMU Wilderness Multilingual Speech Dataset is a newly published multilingual speech dataset based on recorded readings of the New Testament. It provides data to build Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) models for potentially 700 languages. However, the fact that the source content (the Bible), is the same for all the languages is not exploited to date. Therefore, this article proposes to add multilingual links between speech segments in different languages, and shares a large and clean dataset of 8,130 para-lel spoken utterances across 8 languages (56 language pairs).We name this corpus MaSS (Multilingual corpus of Sentence-aligned Spoken utterances). The covered languages (Basque, English, Finnish, French, Hungarian, Romanian, Russian and Spanish) allow researches on speech-to-speech alignment as well as on translation for syntactically divergent language pairs. The quality of the final corpus is attested by human evaluation performed on a corpus subset (100 utterances, 8 language pairs).Paper | GitHub Repository containing the scripts needed to build the data set from scratch (if needed)Project structureThis repository contains 8 Numpy files, one for each featured language, pickled with Python 3.6. Each line corresponds to the spectrogram of the file mentioned in the file verses.csv. There is a direct mapping between the ID of the verse and its index in the list (thus verse with ID 5634 is located at index 5634 in the Numpy file). Verses not available for a given language (as stated by the value "Not Available" in the CSV file) are represented by empty lists in the Numpy files, thus ensuring a perfect verse-to-verse alignement between each file.Spectrogram were extracted using Librosa with the following parameters:Pre-emphasis = 0.97Sample rate = 16000Window size = 0.025Window stride = 0.01Window type = 'hamming'Mel coefficients = 40Min frequency = 20

  13. Party strength in each US state

    • kaggle.com
    Updated Jan 13, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GeneBurin (2017). Party strength in each US state [Dataset]. https://www.kaggle.com/datasets/kiwiphrases/partystrengthbystate
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 13, 2017
    Dataset provided by
    Kaggle
    Authors
    GeneBurin
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    United States
    Description

    Data on party strength in each US state

    The repository contains data on party strength for each state as shown on each state's corresponding party strength Wikipedia page (for example, here is Virginia )

    Each state has a table of a detailed summary of the state of its governing and representing bodies on Wikipedia but there is no data set that collates these entries. I scraped each state's Wikipedia table and collated the entries into a single dataset. The data are stored in the state_party_strength.csv and state_party_strength_cleaned.csv. The code that generated the file can be found in corresponding Python notebooks.

    Data contents:

    The data contain information from 1980 on each state's: 1. governor and party 2. state house and senate composition 3. state representative composition in congress 4. electoral votes

    Clean Version

    Data in the clean version has been cleaned and processed substantially. Namely: - all columns now contain homogenous data within the column - names and Wiki-citations have been removed - only the party counts and party identification have been left The notebook that created this file is here

    Uncleaned Data Version

    The data contained herein have not been altered from their Wikipedia tables except in two instances: - Forced column names to be in accord across states - Any needed data modifications (ie concatenated string columns) to retain information when combining columns

    To use the data:

    Please note that the right encoding for the dataset is "ISO-8859-1", not 'utf-8' though in future versions I will try to fix that to make it more accessible.

    This means that you will likely have to perform further data wrangling prior to doing any substantive analysis. The notebook that has been used to create this data file is located here

    Raw scraped data

    The raw scraped data can be found in the pickle. This file contains a Python dictionary where each key is a US state name and each element is the raw scraped table in Pandas DataFrame format.

    Hope it proves as useful to you in analyzing/using political patterns at the state level in the US for political and policy research.

  14. Simulator data

    • data.europa.eu
    • zenodo.org
    unknown
    Updated Jul 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2025). Simulator data [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-7057250?locale=bg
    Explore at:
    unknown(266936983)Available download formats
    Dataset updated
    Jul 3, 2025
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The archive contains simulator data in csv format and routine in python enabling their post traitement and plotting. A "README" file explains how to use these routines. These data have been stored during the final EFAICTS project evaluations and used in a publication also available on zenodo: 10.5281/zenodo.6796534

  15. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
bisma nawal (2024). automatic dirt detection vacuum cleaner [Dataset]. https://www.kaggle.com/datasets/bismanawal/automatic-dirt-detection-vacuum-cleaner
Organization logo

automatic dirt detection vacuum cleaner

This Python program simulates an automatic vacuum cleaner in a room using a data

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 28, 2024
Dataset provided by
Kaggle
Authors
bisma nawal
Description

This Python program simulates an automatic vacuum cleaner in a room using a dataset. The vacuum cleaner detects dirt and obstacles, cleans the dirt, and avoids obstacles. The program reads the room layout from a CSV file, processes each cell to check for dirt or obstacles, and updates the room status accordingly

Search
Clear search
Close search
Google apps
Main menu