38 datasets found
  1. Natural Questions Dataset

    • kaggle.com
    zip
    Updated Mar 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    fujoos (2024). Natural Questions Dataset [Dataset]. https://www.kaggle.com/datasets/frankossai/natural-questions-dataset
    Explore at:
    zip(116502047 bytes)Available download formats
    Dataset updated
    Mar 15, 2024
    Authors
    fujoos
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Context

    The Natural Questions (NQ) dataset is a comprehensive collection of real user queries submitted to Google Search, with answers sourced from Wikipedia by expert annotators. Created by Google AI Research, this dataset aims to support the development and evaluation of advanced automated question-answering systems. The version provided here includes 89,312 meticulously annotated entries, tailored for ease of access and utility in natural language processing (NLP) and machine learning (ML) research.

    Data Collection

    The dataset is composed of authentic search queries from Google Search, reflecting the wide range of information sought by users globally. This approach ensures a realistic and diverse set of questions for NLP applications.

    Data Pre-processing

    The NQ dataset underwent significant pre-processing to prepare it for NLP tasks: - Removal of web-specific elements like URLs, hashtags, user mentions, and special characters using Python's "BeautifulSoup" and "regex" libraries. - Grammatical error identification and correction using the "LanguageTool" library, an open-source grammar, style, and spell checker.

    These steps were taken to clean and simplify the text while retaining the essence of the questions and their answers, divided into 'questions', 'long answers', and 'short answers'.

    Data Storage

    The unprocessed data, including answers with embedded HTML, empty or complex long and short answers, is stored in "Natural-Questions-Base.csv". This version retains the raw structure of the data, featuring HTML elements in answers, and varied answer formats such as tables and lists, providing a comprehensive view for those interested in the original dataset's complexity and richness. The processed data is compiled into a single CSV file named "Natural-Questions-Filtered.csv". The file is structured for easy access and analysis, with each record containing the processed question, a detailed answer, and concise answer snippets.

    Filtered Results

    The filtered version is available where specific criteria, such as question length or answer complexity, were applied to refine the data further. This version allows for more focused research and application development.

    Flask CSV Reader App

    The repository at 'https://github.com/fujoos/natural_questions' also includes a Flask-based CSV reader application designed to read and display contents from the "NaturalQuestions.csv" file. The app provides functionalities such as: - Viewing questions and answers directly in your browser. - Filtering results based on criteria like question keywords or answer length. -See the live demo using the csv files converted to slite db at 'https://fujoos.pythonanywhere.com/'

  2. H

    Python Codes for Data Analysis of The Impact of COVID-19 on Technical...

    • dataverse.harvard.edu
    • figshare.com
    Updated Mar 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elizabeth Szkirpan (2022). Python Codes for Data Analysis of The Impact of COVID-19 on Technical Services Units Survey Results [Dataset]. http://doi.org/10.7910/DVN/SXMSDZ
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 21, 2022
    Dataset provided by
    Harvard Dataverse
    Authors
    Elizabeth Szkirpan
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Copies of Anaconda 3 Jupyter Notebooks and Python script for holistic and clustered analysis of "The Impact of COVID-19 on Technical Services Units" survey results. Data was analyzed holistically using cleaned and standardized survey results and by library type clusters. To streamline data analysis in certain locations, an off-shoot CSV file was created so data could be standardized without compromising the integrity of the parent clean file. Three Jupyter Notebooks/Python scripts are available in relation to this project: COVID_Impact_TechnicalServices_HolisticAnalysis (a holistic analysis of all survey data) and COVID_Impact_TechnicalServices_LibraryTypeAnalysis (a clustered analysis of impact by library type, clustered files available as part of the Dataverse for this project).

  3. Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic...

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv, zip
    Updated Dec 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexander R. Hartloper; Alexander R. Hartloper; Selimcan Ozden; Albano de Castro e Sousa; Dimitrios G. Lignos; Dimitrios G. Lignos; Selimcan Ozden; Albano de Castro e Sousa (2022). Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials [Dataset]. http://doi.org/10.5281/zenodo.6965147
    Explore at:
    bin, zip, csvAvailable download formats
    Dataset updated
    Dec 24, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Alexander R. Hartloper; Alexander R. Hartloper; Selimcan Ozden; Albano de Castro e Sousa; Dimitrios G. Lignos; Dimitrios G. Lignos; Selimcan Ozden; Albano de Castro e Sousa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials

    Background

    This dataset contains data from monotonic and cyclic loading experiments on structural metallic materials. The materials are primarily structural steels and one iron-based shape memory alloy is also included. Summary files are included that provide an overview of the database and data from the individual experiments is also included.

    The files included in the database are outlined below and the format of the files is briefly described. Additional information regarding the formatting can be found through the post-processing library (https://github.com/ahartloper/rlmtp/tree/master/protocols).

    Usage

    • The data is licensed through the Creative Commons Attribution 4.0 International.
    • If you have used our data and are publishing your work, we ask that you please reference both:
      1. this database through its DOI, and
      2. any publication that is associated with the experiments. See the Overall_Summary and Database_References files for the associated publication references.

    Included Files

    • Overall_Summary_2022-08-25_v1-0-0.csv: summarises the specimen information for all experiments in the database.
    • Summarized_Mechanical_Props_Campaign_2022-08-25_v1-0-0.csv: summarises the average initial yield stress and average initial elastic modulus per campaign.
    • Unreduced_Data-#_v1-0-0.zip: contain the original (not downsampled) data
      • Where # is one of: 1, 2, 3, 4, 5, 6. The unreduced data is broken into separate archives because of upload limitations to Zenodo. Together they provide all the experimental data.
      • We recommend you un-zip all the folders and place them in one "Unreduced_Data" directory similar to the "Clean_Data"
      • The experimental data is provided through .csv files for each test that contain the processed data. The experiments are organised by experimental campaign and named by load protocol and specimen. A .pdf file accompanies each test showing the stress-strain graph.
      • There is a "db_tag_clean_data_map.csv" file that is used to map the database summary with the unreduced data.
      • The computed yield stresses and elastic moduli are stored in the "yield_stress" directory.
    • Clean_Data_v1-0-0.zip: contains all the downsampled data
      • The experimental data is provided through .csv files for each test that contain the processed data. The experiments are organised by experimental campaign and named by load protocol and specimen. A .pdf file accompanies each test showing the stress-strain graph.
      • There is a "db_tag_clean_data_map.csv" file that is used to map the database summary with the clean data.
      • The computed yield stresses and elastic moduli are stored in the "yield_stress" directory.
    • Database_References_v1-0-0.bib
      • Contains a bibtex reference for many of the experiments in the database. Corresponds to the "citekey" entry in the summary files.

    File Format: Downsampled Data

    These are the "LP_

    • The header of the first column is empty: the first column corresponds to the index of the sample point in the original (unreduced) data
    • Time[s]: time in seconds since the start of the test
    • e_true: true strain
    • Sigma_true: true stress in MPa
    • (optional) Temperature[C]: the surface temperature in degC

    These data files can be easily loaded using the pandas library in Python through:

    import pandas
    data = pandas.read_csv(data_file, index_col=0)

    The data is formatted so it can be used directly in RESSPyLab (https://github.com/AlbanoCastroSousa/RESSPyLab). Note that the column names "e_true" and "Sigma_true" were kept for backwards compatibility reasons with RESSPyLab.

    File Format: Unreduced Data

    These are the "LP_

    • The first column is the index of each data point
    • S/No: sample number recorded by the DAQ
    • System Date: Date and time of sample
    • Time[s]: time in seconds since the start of the test
    • C_1_Force[kN]: load cell force
    • C_1_Déform1[mm]: extensometer displacement
    • C_1_Déplacement[mm]: cross-head displacement
    • Eng_Stress[MPa]: engineering stress
    • Eng_Strain[]: engineering strain
    • e_true: true strain
    • Sigma_true: true stress in MPa
    • (optional) Temperature[C]: specimen surface temperature in degC

    The data can be loaded and used similarly to the downsampled data.

    File Format: Overall_Summary

    The overall summary file provides data on all the test specimens in the database. The columns include:

    • hidden_index: internal reference ID
    • grade: material grade
    • spec: specifications for the material
    • source: base material for the test specimen
    • id: internal name for the specimen
    • lp: load protocol
    • size: type of specimen (M8, M12, M20)
    • gage_length_mm_: unreduced section length in mm
    • avg_reduced_dia_mm_: average measured diameter for the reduced section in mm
    • avg_fractured_dia_top_mm_: average measured diameter of the top fracture surface in mm
    • avg_fractured_dia_bot_mm_: average measured diameter of the bottom fracture surface in mm
    • fy_n_mpa_: nominal yield stress
    • fu_n_mpa_: nominal ultimate stress
    • t_a_deg_c_: ambient temperature in degC
    • date: date of test
    • investigator: person(s) who conducted the test
    • location: laboratory where test was conducted
    • machine: setup used to conduct test
    • pid_force_k_p, pid_force_t_i, pid_force_t_d: PID parameters for force control
    • pid_disp_k_p, pid_disp_t_i, pid_disp_t_d: PID parameters for displacement control
    • pid_extenso_k_p, pid_extenso_t_i, pid_extenso_t_d: PID parameters for extensometer control
    • citekey: reference corresponding to the Database_References.bib file
    • yield_stress_mpa_: computed yield stress in MPa
    • elastic_modulus_mpa_: computed elastic modulus in MPa
    • fracture_strain: computed average true strain across the fracture surface
    • c,si,mn,p,s,n,cu,mo,ni,cr,v,nb,ti,al,b,zr,sn,ca,h,fe: chemical compositions in units of %mass
    • file: file name of corresponding clean (downsampled) stress-strain data

    File Format: Summarized_Mechanical_Props_Campaign

    Meant to be loaded in Python as a pandas DataFrame with multi-indexing, e.g.,

    tab1 = pd.read_csv('Summarized_Mechanical_Props_Campaign_' + date + version + '.csv',
              index_col=[0, 1, 2, 3], skipinitialspace=True, header=[0, 1],
              keep_default_na=False, na_values='')
    • citekey: reference in "Campaign_References.bib".
    • Grade: material grade.
    • Spec.: specifications (e.g., J2+N).
    • Yield Stress [MPa]: initial yield stress in MPa
      • size, count, mean, coefvar: number of experiments in campaign, number of experiments in mean, mean value for campaign, coefficient of variation for campaign
    • Elastic Modulus [MPa]: initial elastic modulus in MPa
      • size, count, mean, coefvar: number of experiments in campaign, number of experiments in mean, mean value for campaign, coefficient of variation for campaign

    Caveats

    • The files in the following directories were tested before the protocol was established. Therefore, only the true stress-strain is available for each:
      • A500
      • A992_Gr50
      • BCP325
      • BCR295
      • HYP400
      • S460NL
      • S690QL/25mm
      • S355J2_Plates/S355J2_N_25mm and S355J2_N_50mm
  4. FC Barcelona Champions League 24/25 Stats

    • kaggle.com
    zip
    Updated May 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ali Ratel (2025). FC Barcelona Champions League 24/25 Stats [Dataset]. https://www.kaggle.com/datasets/aliratel01/fc-barcelona-champions-league-2425-stats
    Explore at:
    zip(18078 bytes)Available download formats
    Dataset updated
    May 7, 2025
    Authors
    Ali Ratel
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    https://www.aljazeera.com/wp-content/uploads/2025/04/2025-04-09T201246Z_924829146_UP1EL491K587N_RTRMADP_3_SOCCER-CHAMPIONS-BAR-BVB-REPORT-1744236566.jpg?resize=1920%2C1440" alt="">

    Barcelona Champions League 2024–2025 Stats

    This dataset captures FC Barcelona's journey in the 2024–2025 UEFA Champions League, with detailed statistics scraped from FBref.com. It includes comprehensive match and player-level data covering all major performance areas such as passing, shooting, defending, goalkeeping, and more.

    The data was collected using Python and Playwright, and organized into clean CSV files for easy analysis. Github Scraping Code

    📁 Dataset Content

    The dataset was collected by scraping FBref’s publicly available tables using Python and Playwright. The following tables were extracted:

    FilenameDescription
    Standard_Stats_2024-2025_Barcelona_Champions_League.csvBasic stats per player (games, goals, assists, etc.)
    Scores_and_Fixtures_2024-2025_Barcelona_Champions_League.csvMatch results, Dates, Formation, etc.
    Goalkeeping_2024-2025_Barcelona_Champions_League.csvGoal against, Saves, Wins, Losses, etc.
    Advanced_Goalkeeping_2024-2025_Barcelona_Champions_League.csvGoal against, Post-Shot Expected Goals, Throws Attempted, etc.
    Shooting_2024-2025_Barcelona_Champions_League.csvShot types, Goals, Pentalty kicks, etc.
    Passing_2024-2025_Barcelona_Champions_League.csvTotal passes, Passes Distance, Key Passes, Assists, Exp Assists, etc.
    Pass_Types_2024-2025_Barcelona_Champions_League.csvPass types, Crosses, Switches, etc.
    Goal_and_Shot_Creation_2024-2025_Barcelona_Champions_League.csvShot-Creating Actions, Goal-Creating Actions, etc.
    Defensive_Actions_2024-2025_Barcelona_Champions_League.csvTackles, Dribbles, etc.
    Possession_2024-2025_Barcelona_Champions_League.csvBall Touches, Carries, Take-ons, etc.
    Playing_Time_2024-2025_Barcelona_Champions_League.csvMinutes, Starts, Substitutions, etc.
    Miscellaneous_Stats_2024-2025_Barcelona_Champions_League.csvFouls, Cards, Offsides, Aerials won/lost, etc.
    League_phase,_Champions_League.csvChampions League group/phase info

    📌 Note: This dataset is not fully cleaned. It contains missing values (NaN). In addition, multiple tables need to be merged to get a complete picture of each player's performance. This makes the dataset a great opportunity for beginners to practice data cleaning, handling missing data, and combining related datasets for analysis.

  5. c

    Data Extraction from Vint Marketplace

    • crawlfeeds.com
    csv, zip
    Updated Dec 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crawl Feeds (2024). Data Extraction from Vint Marketplace [Dataset]. https://crawlfeeds.com/datasets/data-extraction-from-vint-marketplace-comprehensive-csv-dataset-with-20k-records
    Explore at:
    csv, zipAvailable download formats
    Dataset updated
    Dec 31, 2024
    Dataset authored and provided by
    Crawl Feeds
    License

    https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy

    Description

    Looking for reliable and actionable data from the Vint Marketplace? Our expertly extracted dataset is just what you need. With over 20,000 records in CSV format, this dataset is tailored to meet the needs of analysts, researchers, and businesses looking to gain valuable insights into the thriving marketplace for fine wines and spirits.

    What’s Included in the Vint Marketplace Dataset?

    • Comprehensive Data Points: Detailed records covering product names, vintages, regions, pricing, and more.
    • Clean CSV Format: Optimized for easy import into tools like Excel, Python, or Power BI for seamless analysis.
    • Updated and Accurate: Freshly sourced from Vint Marketplace to ensure the most relevant and up-to-date information.

    Benefits of Using the Vint Marketplace CSV Dataset

    1. Streamlined Analysis: Easily identify trends in wine pricing, regional popularity, and investment-grade bottles.
    2. Time-Saving: Skip manual data collection with a pre-extracted dataset ready for use.
    3. Versatility: Ideal for building predictive models, crafting detailed market reports, or expanding product catalogs.

    Why Choose Our Dataset?

    We understand the value of quality data in driving decisions. This 20k-record CSV dataset is meticulously compiled to provide structured and accessible information for your specific requirements. Whether you're conducting market research or building an e-commerce platform, this dataset offers the granular detail you need.

    Get Started Today

    Unlock the potential of fine wine data with our Vint Marketplace CSV dataset. With its organized format and extensive records, it’s the perfect resource to elevate your projects. Contact us now to access the dataset and take the next step in data-driven decision-making.

  6. D

    CompuCrawl: Full database and code

    • dataverse.nl
    Updated Sep 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Richard Haans; Richard Haans (2025). CompuCrawl: Full database and code [Dataset]. http://doi.org/10.34894/OBVAOY
    Explore at:
    Dataset updated
    Sep 23, 2025
    Dataset provided by
    DataverseNL
    Authors
    Richard Haans; Richard Haans
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This folder contains the full set of code and data for the CompuCrawl database. The database contains the archived websites of publicly traded North American firms listed in the Compustat database between 1996 and 2020\u2014representing 11,277 firms with 86,303 firm/year observations and 1,617,675 webpages in the final cleaned and selected set.The files are ordered by moment of use in the work flow. For example, the first file in the list is the input file for code files 01 and 02, which create and update the two tracking files "scrapedURLs.csv" and "URLs_1_deeper.csv" and which write HTML files to its folder. "HTML.zip" is the resultant folder, converted to .zip for ease of sharing. Code file 03 then reads this .zip file and is therefore below it in the ordering.The full set of files, in order of use, is as follows:Compustat_2021.xlsx: The input file containing the URLs to be scraped and their date range.01 Collect frontpages.py: Python script scraping the front pages of the list of URLs and generating a list of URLs one page deeper in the domains.URLs_1_deeper.csv: List of URLs one page deeper on the main domains.02 Collect further pages.py: Python script scraping the list of URLs one page deeper in the domains.scrapedURLs.csv: Tracking file containing all URLs that were accessed and their scraping status.HTML.zip: Archived version of the set of individual HTML files.03 Convert HTML to plaintext.py: Python script converting the individual HTML pages to plaintext.TXT_uncleaned.zip: Archived version of the converted yet uncleaned plaintext files.input_categorization_allpages.csv: Input file for classification of pages using GPT according to their HTML title and URL.04 GPT application.py: Python script using OpenAI\u2019s API to classify selected pages according to their HTML title and URL.categorization_applied.csv: Output file containing classification of selected pages.exclusion_list.xlsx: File containing three sheets: 'gvkeys' containing the GVKEYs of duplicate observations (that need to be excluded), 'pages' containing page IDs for pages that should be removed, and 'sentences' containing (sub-)sentences to be removed.05 Clean and select.py: Python script applying data selection and cleaning (including selection based on page category), with setting and decisions described at the top of the script. This script also combined individual pages into one combined observation per GVKEY/year.metadata.csv: Metadata containing information on all processed HTML pages, including those not selected.TXT_cleaned.zip: Archived version of the selected and cleaned plaintext page files. This file serves as input for the word embeddings application.TXT_combined.zip: Archived version of the combined plaintext files at the GVKEY/year level. This file serves as input for the data description using topic modeling.06 Topic model.R: R script that loads up the combined text data from the folder stored in "TXT_combined.zip", applies further cleaning, and estimates a 125-topic model.TM_125.RData: RData file containing the results of the 125-topic model.loadings125.csv: CSV file containing the loadings for all 125 topics for all GVKEY/year observations that were included in the topic model.125_topprob.xlsx: Overview of top-loading terms for the 125 topic model.07 Word2Vec train and align.py: Python script that loads the plaintext files in the "TXT_cleaned.zip" archive to train a series of Word2Vec models and subsequently align them in order to compare word embeddings across time periods.Word2Vec_models.zip: Archived version of the saved Word2Vec models, both unaligned and aligned.08 Word2Vec work with aligned models.py: Python script which loads the trained Word2Vec models to trace the development of the embeddings for the terms \u201csustainability\u201d and \u201cprofitability\u201d over time.99 Scrape further levels down.py: Python script that can be used to generate a list of unscraped URLs from the pages that themselves were one level deeper than the front page.URLs_2_deeper.csv: CSV file containing unscraped URLs from the pages that themselves were one level deeper than the front page.For those only interested in downloading the final database of texts, the files "HTML.zip", "TXT_uncleaned.zip", "TXT_cleaned.zip", and "TXT_combined.zip" contain the full set of HTML pages, the processed but uncleaned texts, the selected and cleaned texts, and combined and cleaned texts at the GVKEY/year level, respectively.The following webpage contains answers to frequently asked questions: https://haans-mertens.github.io/faq/. More information on the database and the underlying project can be found here: https://haans-mertens.github.io/ and the following article: \u201cThe Internet Never Forgets: A Four-Step Scraping Tutorial, Codebase, and Database for Longitudinal Organizational Website Data\u201d, by Richard F.J. Haans and Marc J. Mertens in Organizational Research Methods. The full paper can be accessed here.

  7. Countries by population 2021 (Worldometer)

    • kaggle.com
    zip
    Updated Aug 16, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artem Zapara (2021). Countries by population 2021 (Worldometer) [Dataset]. https://www.kaggle.com/datasets/artemzapara/countries-by-population-2021-worldometer
    Explore at:
    zip(8163 bytes)Available download formats
    Dataset updated
    Aug 16, 2021
    Authors
    Artem Zapara
    Description

    Context

    This dataset is a clean CSV file with the most recent estimates of the population of the countries according to Wolrdometer. The data is taken from the following link: https://www.worldometers.info/world-population/population-by-country/

    Content

    The data has been generated by websraping the aforementioned link on the 16th August 2021. Below is the code used to make CSV data in Python 3.8: import requests from bs4 import BeautifulSoup import pandas as pd url = "https://www.worldometers.info/world-population/population-by-country/" r = requests.get(url) soup = BeautifulSoup(r.content) countries = soup.find_all("table")[0] dataframe = pd.read_html(str(countries))[0] dataframe.to_csv("countries_by_population_2021.csv", index=False)

    Acknowledgements

    The creation of this dataset would not be possible without a team of Worldometers, a data aggregation website.

  8. h

    alpine1.1-multireq-instructions-seed

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marcus Cedric R. Idia, alpine1.1-multireq-instructions-seed [Dataset]. https://huggingface.co/datasets/marcuscedricridia/alpine1.1-multireq-instructions-seed
    Explore at:
    Authors
    Marcus Cedric R. Idia
    Description

    This dataset is a refined version of Alpine 1.0. It was created by generating tasks using various LLMs, wrapping them in special elements {Instruction Start} ... {Instruction End}, and saving them in a text file. We then processed this file with a Python script that used regex to extract the tasks into a CSV. Afterward, we cleaned the dataset by removing near-duplicates, vague prompts, and ambiguous entries. python clean.py -i prompts.csv -o cleaned.csv -p "prompt" -t 0.92 -l 30 This dataset… See the full description on the dataset page: https://huggingface.co/datasets/marcuscedricridia/alpine1.1-multireq-instructions-seed.

  9. Data from: Global spatially explicit crop water consumption shows an overall...

    • zenodo.org
    bin, zip
    Updated Oct 6, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abebe Chukalla; Abebe Chukalla; Mesfin Mekonnen; Mesfin Mekonnen; Dahami Gunathilake; Dahami Gunathilake; Fitsume Teshome Wolkeba; Fitsume Teshome Wolkeba; Bhawani Gunasekara; Bhawani Gunasekara; Davy Vanham; Davy Vanham (2025). Global spatially explicit crop water consumption shows an overall increase of 9% for 46 agricultural crops from 2010 to 2020: Data and software [Dataset]. http://doi.org/10.5281/zenodo.17059989
    Explore at:
    zip, binAvailable download formats
    Dataset updated
    Oct 6, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Abebe Chukalla; Abebe Chukalla; Mesfin Mekonnen; Mesfin Mekonnen; Dahami Gunathilake; Dahami Gunathilake; Fitsume Teshome Wolkeba; Fitsume Teshome Wolkeba; Bhawani Gunasekara; Bhawani Gunasekara; Davy Vanham; Davy Vanham
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset comprises spatial and temporal data related to our analysis on blue and green water consumption (WC) of global crop production in high spatial resolution (5 arc-minutes – approximately 10 km at the equator) for the years 2020, 2010 and 2000.

    Modelling water consumption of SPAM data

    We use SPAM (Spatial Production Allocation Model) data, released by the International Food Policy research Institute (IFPRI). We use SPAM2020 data for the year 2020 (46 crops), SPAM2010 data for the year 2010 (42 crops) and SPAM2000 data for the year 2000 (20 crops).

    We develop a Python-based global gridded crop green and blue WC assessment tool, entitled CropGBWater. Operating on a daily time scale, CropGBWater dynamically simulates rootzone water balance and related fluxes. We provide this model open access as Data_S10

    SPAM2020 crop data are modelled for the years 2018-2022, SPAM2010 crop data for the years 2008-2012 and SPAM2000 crop data for the years 1998-2002. We compute WCbl (blue WC) and WCgn (green WC), with components WCgn,irr (green WC of irrigated area) and WCgn,rf (green WC of rainfed area)

    File description:

    The data-set consists of the following files:

    • Data_S4: Data_S4_Y2020_WC_m3_gridded.zip
      Folder with 46 individual crop grid files (5arc min resolution, with x & y coordinates), monthly and annual WCbl, WCgn,irr and WCgn,rf values in m3 in csv format, year 2020. Individual crop GIS-Rasters for annual m3 amounts are provided as Data_S17
    • Data_S5: Data_S5_YR2020_WC_mm_gridded_csv
      Folder with 46 individual crop grid files (5arc min resolution, with x & y coordinates), monthly and annual WCbl, WCgn,irr and WCgn, rf in mm as well as SPAM harvested area values in csv format, year 2020. Individual crop GIS-Rasters for annual mm amounts are provided as Data_S18
    • Data_S6: Data_S6_YR2020_WC_gridded_individual-crops-m3_annual.xlsx
      One grid file (5arc min resolution, with x & y coordinates) with annual WCbl, WCgn,irr and WCgn, rf values in m3, differentiating between individual crops, year 2020.
    • Data_S7: Data_S7_YR2020_WC_gridded_sum-of-crops-m3_monthly-annual.csv
      One grid file (5arc min resolution, with x & y coordinates) with monthly and annual WCbl, WCgn,irr and WCgn, rf values in m3, for the sum of all crops, year 2020
    • Data_S8: Data_S8_YR2000_WC_mm_m3_gridded.zip
      Grid (5arc min resolution, with x & y coordinates) with annual WCbl, WCgn,irr and WCgn, rf values in mm and m3, as well as SPAM harvested area amounts, for each crop, year 2000
    • Data_S9: Data_S9_YR2010_WC_mm_m3_gridded.zip
      Grid (5arc min resolution, with x & y coordinates) with annual WCbl, WCgn,irr and WCgn, rf values in mm and m3, as well as SPAM harvested area amounts, for each crop, year 2010
    • Data_S10: Data_S10_CropGBWater_v02_1c-clean.ipynb Python-based global gridded crop green and blue WC assessment tool, entitled CropGBWater

    Please only use the latest version of this zenodo repository

    Publication:

    For all details, please refer to the open access paper:

    Chukalla, A.D., Mekonnen, M.M., Gunathilake, D., Wolkeba, F.T., Gunasekara, B., Vanham, D. (2025) Global spatially explicit crop water consumption shows an overall increase of 9% for 46 agricultural crops from 2010 to 2020, Nature Food, Volume 6, https://doi.org/10.1038/s43016-025-01231-x

    Funding:

    This research, led by IWMI, a CGIAR centre, was carried out under the CGIAR Initiative on Foresight (www.cgiar.org/initiative/foresight/) as well as the CGIAR “Policy innovations” Science Program (www.cgiar.org/cgiar-research-porfolio-2025-2030/policy-innovations). The authors would like to thank all funders who supported this research through their contributions to the CGIAR Trust Fund (www.cgiar.org/funders).

  10. Dirty Dataset to practice Data Cleaning

    • kaggle.com
    zip
    Updated May 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Kanju (2024). Dirty Dataset to practice Data Cleaning [Dataset]. https://www.kaggle.com/datasets/martinkanju/dirty-dataset-to-practice-data-cleaning
    Explore at:
    zip(1235 bytes)Available download formats
    Dataset updated
    May 20, 2024
    Authors
    Martin Kanju
    Description

    Dataset

    This dataset was created by Martin Kanju

    Released under Other (specified in description)

    Contents

  11. d

    Data from: Data to Estimate Water Use Associated with Oil and Gas...

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Nov 27, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Data to Estimate Water Use Associated with Oil and Gas Development within the Bureau of Land Management Carlsbad Field Office Area, New Mexico [Dataset]. https://catalog.data.gov/dataset/data-to-estimate-water-use-associated-with-oil-and-gas-development-within-the-bureau-of-la
    Explore at:
    Dataset updated
    Nov 27, 2025
    Dataset provided by
    U.S. Geological Survey
    Area covered
    New Mexico
    Description

    The purpose of this data release is to provide data in support of the Bureau of Land Management's (BLM) Reasonably Foreseeable Development (RFD) Scenario by estimating water-use associated with oil and gas extraction methods within the BLM Carlsbad Field Office (CFO) planning area, located in Eddy and Lea Counties as well as part of Chaves County, New Mexico. Three comma separated value files and two python scripts are included in this data release. It was determined that all reported oil and gas wells within Chaves County from the FracFocus and New Mexico Oil Conservation Division (NM OCD) databases were outside of the CFO administration area and were excluded from well_records.csv and modeled_estimates.csv. Data from Chaves County are included in the produced_water.csv file to be consistent with the BLM’s water support document. Data were synthesized into comma separated values which include, produced_water.csv (volume) from NM OCD, well_records.csv (including location and completion) from NM OCD and FracFocus, and modeled_estimates.csv (using FracFocus as well as Ball and others (2020) as input data). The results from modeled_estimates.csv were obtained using a previously published regression model (McShane and McDowell, 2021) to estimate water use associated with unconventional oil and gas activities in the Permian Basin (Valder and others, 2021) for the period of interest (2010-2021). Additionally, python scripts to process, clean, and categorize FracFocus data are provided in this data release.

  12. Cleaned Contoso Dataset

    • kaggle.com
    zip
    Updated Aug 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bhanu (2023). Cleaned Contoso Dataset [Dataset]. https://www.kaggle.com/datasets/bhanuthakurr/cleaned-contoso-dataset
    Explore at:
    zip(487695063 bytes)Available download formats
    Dataset updated
    Aug 27, 2023
    Authors
    Bhanu
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Data was imported from the BAK file found here into SQL Server, and then individual tables were exported as CSV. Jupyter Notebook containing the code used to clean the data can be found here

    Version 6 has a some more cleaning and structuring that was noticed after importing in Power BI. Changes were made by adding code in python notebook to export new cleaned dataset, such as adding MonthNumber for sorting by month number, similar for WeekDayNumber.

    Cleaning was done in python while also using SQL Server to quickly find things. Headers were added separately, ensuring no data loss.Data was cleaned for NaN, garbage values and other columns.

  13. Fraudulent Financial Transaction Prediction

    • kaggle.com
    zip
    Updated Feb 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Younus_Mohamed (2025). Fraudulent Financial Transaction Prediction [Dataset]. https://www.kaggle.com/datasets/younusmohamed/fraudulent-financial-transaction-prediction
    Explore at:
    zip(41695207 bytes)Available download formats
    Dataset updated
    Feb 15, 2025
    Authors
    Younus_Mohamed
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Fraud Detection with Imbalanced Data

    Overview
    This dataset is designed to help build, train, and evaluate machine learning models that detect fraudulent transactions. We have included additional CSV files containing location-based scores, proprietary weights for grouping, network turn-around times, and vulnerability scores.

    Key Points
    - Severe Class Imbalance: Only a tiny fraction (less than 1%) of transactions are fraud.
    - Multiple Feature Files: Combine them by matching on id or Group.
    - Target: The Target column in train.csv indicates fraud (1) vs. clean (0).
    - Goal: Predict which transactions in test_share.csv might be fraudulent.

    Files in this Dataset

    1. train.csv

      • Rows: 227,845 (example size)
      • Columns: 28
      • Description: Contains historical transaction data for training a fraud detection model.
      • Important: The Target column (0 = Clean, 1 = Fraud).
    2. test_share.csv

      • Rows: 56,962 (example size)
      • Columns: 27
      • Description: Test dataset, with the same structure as train.csv but without the Target column.
    3. Geo_scores.csv

      • Columns: (id, geo_score)
      • Description: Location-based geospatial scores for each transaction.
    4. Lambda_wts.csv

      • Columns: (Group, lambda_wt)
      • Description: Proprietary “lambda” weights associated with each Group.
    5. Qset_tats.csv

      • Columns: (id, qsets_normalized_tat)
      • Description: Network turn-around times (TAT) for each transaction.
    6. instance_scores.csv

      • Columns: (id, instance_scores)
      • Description: Vulnerability or risk qualification scores for each transaction.

    Suggested Usage

    1. Load all CSVs into dataframes.
    2. Merge additional files (Geo_scores.csv, Lambda_wts.csv, etc.) by matching id or Group.
    3. Explore the severe class imbalance in train.csv (Target ~1% is fraud).
    4. Train any suitable classification model (Random Forest, XGBoost, etc.) on train.csv.
    5. Predict on test_share.csv or your own external data.

    Possible Tools:
    - Python: pandas, NumPy, scikit-learn
    - Imbalance Handling: SMOTE, Random Oversampler, or class weights
    - Metrics: Precision, Recall, F1-score, ROC-AUC, etc.

    Beginner Tip: Check how these extra CSVs (Geo, lambda, instance scores, TAT) might improve fraud detection performance!

    Tags

    • fraud-detection
    • classification
    • imbalanced-data
    • financial-transactions
    • machine-learning
    • python
    • beginner-friendly

    License: CC BY-NC-SA 4.0

  14. Starlink Satellite TLE/CSV dataset (April 2025)

    • kaggle.com
    Updated Jun 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vijay J0shi (2025). Starlink Satellite TLE/CSV dataset (April 2025) [Dataset]. https://www.kaggle.com/datasets/vijayj0shi/starlink-satellite-tlecsv-dataset-april-2025
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 9, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Vijay J0shi
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    CelesTrak Starlink TLE Data (CSV Format)

    About This Dataset

    This dataset contains Starlink satellite data in both CSV and TLE formats. At the top level, it includes four files: one set representing a snapshot of all Starlink satellites at a specific time and another set representing a time-range dataset for STARLINK-1008 from March 11 to April 10, 2025. Additionally, there is a folder named STARLINK_INDIVIDUAL_SATELLITE_CSV_TLE_FILES_WITH_TIME_RANGE, which contains per-satellite data files in both CSV and TLE formats. These cover the time range from January 1, 2024, to June 6, 2025, for individual satellites. The number of files varies as satellites may have been launched at different times within this period.

    This dataset contains processed CSV versions of Starlink satellite data originally available from CelesTrak, a publicly available source for satellite orbital information.

    CelesTrak publishes satellite position data in TLE (Two-Line Element) format, which describes a satellite’s orbit using two compact lines of text. While TLE is the standard format used by satellite agencies, it is difficult to interpret directly for beginners. So this dataset provides a cleaned and structured CSV version that is easier to use with Python and data science libraries.

    What's Inside

    Each file in the dataset corresponds to a specific Starlink satellite and contains its orbital data over a range of dates (usually 1 month). Each row is a snapshot of the satellite's position and movement at a given timestamp.

    Key columns include:

    Column NameDescription
    Satellite_NameUnique identifier for each Starlink satellite. Example: STARLINK-1008.
    EpochThe timestamp (in UTC) representing the exact moment when the satellite's orbital data was recorded.
    Inclination_degAngle between the satellite’s orbital plane and Earth’s equator. 0° means equatorial orbit; 90° means polar orbit.
    EccentricityDescribes the shape of the orbit. 0 = perfect circle; values approaching 1 = highly elliptical.
    Mean_Motion_orbits_per_dayNumber of orbits the satellite completes around Earth in a single day.
    Altitude_kmSatellite’s altitude above Earth’s surface in kilometers, calculated from orbital parameters.
    LatitudeSatellite’s geographic latitude at the recorded time. Positive = Northern Hemisphere, Negative = Southern Hemisphere.
    LongitudeSatellite’s geographic longitude at the recorded time. Positive = East of Prime Meridian, Negative = West.

    Why CSV?

    TLE is a compact format used in aerospace and satellite communications, but:

    • It is not beginner-friendly.
    • It requires a dedicated parser.
    • It’s difficult to visualize or analyze directly.

    That’s why this dataset presents the same orbital data but in a clean and normalized CSV structure ready for analysis and machine learning.

    Use Cases

    • Satellite orbit visualization.
    • Time-series analysis of Starlink constellations.
    • Anomaly detection (e.g., using autoencoders or clustering).
    • Feature engineering for orbit-based models.
    • Educational projects for learning satellite mechanics.

    Data Source

    • Source: CelesTrak Starlink TLE Feed
    • Converted to CSV using custom Python scripts.
    • Time range: Typically one month per file (can vary).
  15. Simulator data

    • data.europa.eu
    unknown
    Updated Jul 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2025). Simulator data [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-7057250?locale=bg
    Explore at:
    unknown(266936983)Available download formats
    Dataset updated
    Jul 3, 2025
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The archive contains simulator data in csv format and routine in python enabling their post traitement and plotting. A "README" file explains how to use these routines. These data have been stored during the final EFAICTS project evaluations and used in a publication also available on zenodo: 10.5281/zenodo.6796534

  16. Python Code Snippets for Bug Detection

    • kaggle.com
    zip
    Updated Oct 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jagriti Srivastava (2025). Python Code Snippets for Bug Detection [Dataset]. https://www.kaggle.com/datasets/jagritisrivastava/python-code-snippets-for-bug-detection
    Explore at:
    zip(18739 bytes)Available download formats
    Dataset updated
    Oct 28, 2025
    Authors
    Jagriti Srivastava
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Overview:

    This dataset contains Python function and class snippets extracted from multiple public repositories. Each snippet is labeled as clean (0) or buggy (1). It is intended for training machine learning models for automated bug detection, code quality analysis, and code classification tasks.

    Contents:

    • JSON file (dataset.json) containing all code snippets.

    • CSV file (dataset.csv) formatted for Kaggle, with columns:

    • code: Python snippet

    • label: 0 = clean, 1 = buggy

    Usage:

    • Train ML models for code bug detection.

    • Experiment with static analysis, code classification, or NLP models on code.

    • Benchmark code analysis tools or AI assistants.

    How It Was Created:

    • Python code from multiple public repositories was parsed to extract function and class snippets.

    • Each snippet was executed to determine if it raises an exception (buggy) or runs cleanly.

    • Additional buggy variants were generated automatically by introducing common code errors (wrong operator, division by zero, missing import, variable renaming).

    Dataset Size:

    • ~XXX snippets (you can replace with actual number)

    • Balanced between clean and buggy code

    License:

    CC0 1.0 Universal (Public Domain) – Free to use for research and commercial purposes.

  17. Comprehensive Formula 1 Dataset (2020-2025)

    • kaggle.com
    Updated Jul 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    V SHREE KAMALESH (2025). Comprehensive Formula 1 Dataset (2020-2025) [Dataset]. https://www.kaggle.com/datasets/vshreekamalesh/comprehensive-formula-1-dataset-2020-2025
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 27, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    V SHREE KAMALESH
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Formula 1 Comprehensive Dataset (2020-2025)

    Dataset Description This comprehensive Formula 1 dataset contains detailed racing data spanning from 2020 to 2025, including race results, qualifying sessions, championship standings, circuit information, and historical driver statistics.

    Perfect for:

    📊 F1 performance analysis

    🤖 Machine learning projects

    📈 Data visualization

    🏆 Championship predictions

    📋 Racing statistics research

    📁 Files Included 1. f1_race_results_2020_2025.csv (53 entries) Race winners and results from Grand Prix weekends

    Date, Grand Prix name, race winner

    Constructor, nationality, grid position

    Race time, fastest lap time, points scored

    1. f1_qualifying_results_2020_2024.csv (820 entries) Qualifying session results with timing data

    Q1, Q2, Q3 session times

    Grid positions, laps completed

    Driver and constructor information

    1. f1_driver_standings_progressive.csv (600 entries) Championship standings progression throughout seasons

    Points accumulation over race weekends

    Wins, podiums, pole positions tracking

    Season-long championship battle data

    1. f1_constructor_standings_progressive.csv (360 entries) Team championship standings evolution

    Constructor points and wins

    Team performance metrics

    Manufacturer rivalry data

    1. f1_circuits_technical_data.csv (24 entries) Technical specifications for all F1 circuits

    Track length, number of turns

    Lap records and record holders

    Circuit designers and first F1 usage

    1. f1_historical_driver_statistics.csv (30 entries) All-time career statistics for F1 drivers

    Career wins, poles, podiums

    Racing entries and achievements

    Active and retired driver records

    1. f1_comprehensive_dataset_2020_2025.csv (432 entries) MAIN DATASET - Combined data from all sources

    Multiple data types in one file

    Ready for immediate analysis

    Comprehensive F1 information hub

    🔧 Data Features Clean & Structured: All data professionally format

  18. mtcars-parquet

    • kaggle.com
    zip
    Updated Aug 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MUHAMMAD ABDAL (2025). mtcars-parquet [Dataset]. https://www.kaggle.com/datasets/muhammadabdal123/mtcars-parquet
    Explore at:
    zip(1040 bytes)Available download formats
    Dataset updated
    Aug 17, 2025
    Authors
    MUHAMMAD ABDAL
    Description

    Dataset Title: Motor Trend Car Road Tests (mtcars) Description: The data was extracted from the 1974 Motor Trend US magazine and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). It is a classic, foundational dataset used extensively in statistics and data science for learning exploratory data analysis, regression modeling, and hypothesis testing.

    This dataset is a staple in the R programming language (?mtcars) and is now provided here in a clean CSV format for easy access in Python, Excel, and other data analysis environments.

    Acknowledgements: This dataset was originally compiled and made available by the journal Motor Trend in 1974. It has been bundled with the R statistical programming language for decades, serving as an invaluable resource for learners and practitioners alike.

    Data Dictionary: Each row represents a different car model. The columns (variables) are as follows:

    Column Name Data Type Description model object (String) The name and model of the car. mpg float Miles/(US) gallon. A measure of fuel efficiency. cyl integer Number of cylinders (4, 6, 8). disp float Displacement (cubic inches). Engine size. hp integer Gross horsepower. Engine power. drat float Rear axle ratio. Affects torque and fuel economy. wt float Weight (1000 lbs). Vehicle mass. qsec float 1/4 mile time (seconds). A measure of acceleration. vs binary Engine shape (0 = V-shaped, 1 = Straight). am binary Transmission (0 = Automatic, 1 = Manual). gear integer Number of forward gears (3, 4, 5). carb integer Number of carburetors (1, 2, 3, 4, 6, 8). Key Questions & Potential Use Cases: This dataset is perfect for exploring relationships between a car's specifications and its performance. Some classic analysis questions include:

    Fuel Efficiency: What factors are most predictive of a car's miles per gallon (mpg)? Is it engine size (disp), weight (wt), or horsepower (hp)?

    Performance: How does transmission type (am) affect acceleration (qsec) and fuel economy (mpg)? Do manual cars perform better?

    Classification: Can we accurately predict the number of cylinders (cyl) or the type of engine (vs) based on other car features?

    Clustering: Are there natural groupings of cars (e.g., performance cars, economy cars) based on their specifications?

    Inspiration: This is one of the most famous datasets in statistics. You can find thousands of examples, tutorials, and analyses using it online. It's an excellent starting point for:

    Practicing multiple linear regression and correlation analysis.

    Building your first EDA (Exploratory Data Analysis) notebook.

    Learning about feature engineering and model interpretation.

    Comparing statistical results from R and Python (e.g., statsmodels vs scikit-learn).

    File Details: mtcars-parquet.csv: The main dataset file in CSV format.

    Number of instances (rows): 32

    Number of attributes (columns): 12

    Missing Values? No, this is a complete dataset.

  19. Bybit ETH/USDT Historical Data (2021-2025)

    • kaggle.com
    zip
    Updated Jun 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AnubhavBhadani142 (2025). Bybit ETH/USDT Historical Data (2021-2025) [Dataset]. https://www.kaggle.com/datasets/anubhavbhadani142/bybit-ethusdt-historical-data-2021-2025
    Explore at:
    zip(3666866 bytes)Available download formats
    Dataset updated
    Jun 28, 2025
    Authors
    AnubhavBhadani142
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Cryptocurrency trading analysis and algorithmic strategy development rely on high-quality, high-frequency historical data. This dataset provides clean, structured OHLCV data for one of the most liquid and popular trading pairs, ETH/USDT, sourced directly from the Bybit exchange. It is ideal for quantitative analysts, data scientists, and trading enthusiasts looking to backtest strategies, perform market analysis, or build predictive models across different time horizons.

    Content

    The dataset consists of three separate CSV files, each corresponding to a different time frame:

    BYBIT_ETHUSDT_15m.csv: Historical data in 15-minute intervals. BYBIT_ETHUSDT_1h.csv: Historical data in 1-hour intervals. BYBIT_ETHUSDT_4h.csv: Historical data in 4-hour intervals.

    Each file contains the same six columns:

    • Datetime: The UTC timestamp for the start of the candle/bar.
    • Open: The opening price of ETH at the start of the interval.
    • High: The highest price reached during the interval.
    • Low: The lowest price reached during the interval.
    • Close: The closing price at the end of the interval.
    • Volume: The trading volume in the base asset (ETH) during the interval.

    Methodology & Update Schedule

    • Source: The data was collected using the public API of the Bybit cryptocurrency exchange via a Python script utilizing the ccxt library.
    • Data Range: The dataset currently covers the period from July 5, 2021, to June 28, 2025.
    • Update Frequency: This dataset is maintained locally and will be updated on a weekly basis to include the most recent trading data, ensuring its relevance for ongoing analysis.

    Acknowledgements

    This dataset is made possible by the publicly available data from the Bybit exchange. Please consider this when using the data for your projects.

    Inspiration (Potential Use Cases)

    • Backtesting Trading Strategies: Test the performance of strategies like moving average crossovers, RSI-based signals, or MACD indicators.
    • Time Series Forecasting: Build models (e.g., ARIMA, LSTM, Prophet) to predict future price movements.
    • Volatility Analysis: Analyze market volatility by calculating rolling standard deviations or other risk metrics.
    • Feature Engineering: Create new technical indicators and features for machine learning models.
    • Market Visualization: Plot candlestick charts and overlay them with various technical analysis tools.
  20. Danish Residential Housing Prices 1992-2024

    • kaggle.com
    Updated Nov 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Frederiksen (2024). Danish Residential Housing Prices 1992-2024 [Dataset]. https://www.kaggle.com/datasets/martinfrederiksen/danish-residential-housing-prices-1992-2024
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 29, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Martin Frederiksen
    Description

    Danish residential house prices (1992-2024)

    About the dataset (cleaned data)

    The dataset (parquet file) contains approximately 1,5 million residential household sales from Denmark during the periode from 1992 to 2024. All cleaned data is merged into one parquet file here on Kaggle. Note some cleaning might still be nessesary, see notebook under code.

    Also, added a random sample (100k) of the dataset as a csv file.

    Done in Python version: 2.6.3.

    Raw data

    Raw data and more info is avaible on Github repositary: https://github.com/MartinSamFred/Danish-residential-housingPrices-1992-2024.git

    The dataset has been scraped and cleaned (to some extent). Cleaned files are located in: \Housing_data_cleaned \ named DKHousingprices_1 and 2. Saved in parquet format (and saved as two files due to size).

    Cleaning from raw files to above cleaned files is outlined in BoligsalgConcatCleanigGit.ipynb. (done in Python version: 2.6.3)

    Webscraping script: Webscrape_script.ipynb (done in Python version: 2.6.3)

    Provided you want to clean raw files from scratch yourself:

    Uncleaned scraped files (81 in total) are located in \Housing_data_raw \ Housing_data_batch1 and 2. Saved in .csv format and compressed as 7-zip files.

    Additional files added/appended to the Cleaned files are located in \Addtional_data and named DK_inflation_rates, DK_interest_rates, DK_morgage_rates and DK_regions_zip_codes. Saved in .xlsx format.

    Content

    Each row in the dataset contains a residential household sale during the period 1992 - 2024.

    “Cleaned files” columns:

    0 'date': is the transaction date

    1 'quarter': is the quarter based on a standard calendar year

    2 'house_id': unique house id (could be dropped)

    3 'house_type': can be 'Villa', 'Farm', 'Summerhouse', 'Apartment', 'Townhouse'

    4 'sales_type': can be 'regular_sale', 'family_sale', 'other_sale', 'auction', '-' (“-“ could be dropped)

    5 'year_build': range 1000 to 2024 (could be narrowed more)

    6 'purchase_price': is purchase price in DKK

    7 '%_change_between_offer_and_purchase': could differ negatively, be zero or positive

    8 'no_rooms': number of rooms

    9 'sqm': number of square meters

    10 'sqm_price': 'purchase_price' divided by 'sqm_price'

    11 'address': is the address

    12 'zip_code': is the zip code

    13 'city': is the city

    14 'area': 'East & mid jutland', 'North jutland', 'Other islands', 'Capital, Copenhagen', 'South jutland', 'North Zealand', 'Fyn & islands', 'Bornholm'

    15 'region': 'Jutland', 'Zealand', 'Fyn & islands', 'Bornholm'

    16 'nom_interest_rate%': Danish nominal interest rate show pr. quarter however actual rate is not converted from annualized to quarterly

    17 'dk_ann_infl_rate%': Danish annual inflation rate show pr. quarter however actual rate is not converted from annualized to quarterly

    18 'yield_on_mortgage_credit_bonds%': 30 year mortgage bond rate (without spread)

    Uses

    Various (statistical) analysis, visualisation and I assume machine learning as well.

    Practice exercises etc.

    Uncleaned scraped files are great to practice cleaning, especially string cleaning. I’m not an expect as seen in the coding ;-).

    Disclaimer

    The data and information in the data set provided here are intended to be used primarily for educational purposes only. I do not own any data, and all rights are reserved to the respective owners as outlined in “Acknowledgements/sources”. The accuracy of the dataset is not guaranteed accordingly any analysis and/or conclusions is solely at the user's own responsibly and accountability.

    Acknowledgements/sources

    All data is publicly available on:

    Boliga: https://www.boliga.dk/

    Finans Danmark: https://finansdanmark.dk/

    Danmarks Statistik: https://www.dst.dk/da

    Statistikbanken: https://statistikbanken.dk/statbank5a/default.asp?w=2560

    Macrotrends: https://www.macrotrends.net/

    PostNord: https://www.postnord.dk/

    World Data: https://www.worlddata.info/

    Dataset picture / cover photo: Nick Karvounis (https://unsplash.com/)

    Have fun… :-)

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
fujoos (2024). Natural Questions Dataset [Dataset]. https://www.kaggle.com/datasets/frankossai/natural-questions-dataset
Organization logo

Natural Questions Dataset

A CSV version of the Google's Natural Questions Dataset

Explore at:
zip(116502047 bytes)Available download formats
Dataset updated
Mar 15, 2024
Authors
fujoos
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Context

The Natural Questions (NQ) dataset is a comprehensive collection of real user queries submitted to Google Search, with answers sourced from Wikipedia by expert annotators. Created by Google AI Research, this dataset aims to support the development and evaluation of advanced automated question-answering systems. The version provided here includes 89,312 meticulously annotated entries, tailored for ease of access and utility in natural language processing (NLP) and machine learning (ML) research.

Data Collection

The dataset is composed of authentic search queries from Google Search, reflecting the wide range of information sought by users globally. This approach ensures a realistic and diverse set of questions for NLP applications.

Data Pre-processing

The NQ dataset underwent significant pre-processing to prepare it for NLP tasks: - Removal of web-specific elements like URLs, hashtags, user mentions, and special characters using Python's "BeautifulSoup" and "regex" libraries. - Grammatical error identification and correction using the "LanguageTool" library, an open-source grammar, style, and spell checker.

These steps were taken to clean and simplify the text while retaining the essence of the questions and their answers, divided into 'questions', 'long answers', and 'short answers'.

Data Storage

The unprocessed data, including answers with embedded HTML, empty or complex long and short answers, is stored in "Natural-Questions-Base.csv". This version retains the raw structure of the data, featuring HTML elements in answers, and varied answer formats such as tables and lists, providing a comprehensive view for those interested in the original dataset's complexity and richness. The processed data is compiled into a single CSV file named "Natural-Questions-Filtered.csv". The file is structured for easy access and analysis, with each record containing the processed question, a detailed answer, and concise answer snippets.

Filtered Results

The filtered version is available where specific criteria, such as question length or answer complexity, were applied to refine the data further. This version allows for more focused research and application development.

Flask CSV Reader App

The repository at 'https://github.com/fujoos/natural_questions' also includes a Flask-based CSV reader application designed to read and display contents from the "NaturalQuestions.csv" file. The app provides functionalities such as: - Viewing questions and answers directly in your browser. - Filtering results based on criteria like question keywords or answer length. -See the live demo using the csv files converted to slite db at 'https://fujoos.pythonanywhere.com/'

Search
Clear search
Close search
Google apps
Main menu