Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The Natural Questions (NQ) dataset is a comprehensive collection of real user queries submitted to Google Search, with answers sourced from Wikipedia by expert annotators. Created by Google AI Research, this dataset aims to support the development and evaluation of advanced automated question-answering systems. The version provided here includes 89,312 meticulously annotated entries, tailored for ease of access and utility in natural language processing (NLP) and machine learning (ML) research.
The dataset is composed of authentic search queries from Google Search, reflecting the wide range of information sought by users globally. This approach ensures a realistic and diverse set of questions for NLP applications.
The NQ dataset underwent significant pre-processing to prepare it for NLP tasks: - Removal of web-specific elements like URLs, hashtags, user mentions, and special characters using Python's "BeautifulSoup" and "regex" libraries. - Grammatical error identification and correction using the "LanguageTool" library, an open-source grammar, style, and spell checker.
These steps were taken to clean and simplify the text while retaining the essence of the questions and their answers, divided into 'questions', 'long answers', and 'short answers'.
The unprocessed data, including answers with embedded HTML, empty or complex long and short answers, is stored in "Natural-Questions-Base.csv". This version retains the raw structure of the data, featuring HTML elements in answers, and varied answer formats such as tables and lists, providing a comprehensive view for those interested in the original dataset's complexity and richness. The processed data is compiled into a single CSV file named "Natural-Questions-Filtered.csv". The file is structured for easy access and analysis, with each record containing the processed question, a detailed answer, and concise answer snippets.
The filtered version is available where specific criteria, such as question length or answer complexity, were applied to refine the data further. This version allows for more focused research and application development.
The repository at 'https://github.com/fujoos/natural_questions' also includes a Flask-based CSV reader application designed to read and display contents from the "NaturalQuestions.csv" file. The app provides functionalities such as: - Viewing questions and answers directly in your browser. - Filtering results based on criteria like question keywords or answer length. -See the live demo using the csv files converted to slite db at 'https://fujoos.pythonanywhere.com/'
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Copies of Anaconda 3 Jupyter Notebooks and Python script for holistic and clustered analysis of "The Impact of COVID-19 on Technical Services Units" survey results. Data was analyzed holistically using cleaned and standardized survey results and by library type clusters. To streamline data analysis in certain locations, an off-shoot CSV file was created so data could be standardized without compromising the integrity of the parent clean file. Three Jupyter Notebooks/Python scripts are available in relation to this project: COVID_Impact_TechnicalServices_HolisticAnalysis (a holistic analysis of all survey data) and COVID_Impact_TechnicalServices_LibraryTypeAnalysis (a clustered analysis of impact by library type, clustered files available as part of the Dataverse for this project).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials
Background
This dataset contains data from monotonic and cyclic loading experiments on structural metallic materials. The materials are primarily structural steels and one iron-based shape memory alloy is also included. Summary files are included that provide an overview of the database and data from the individual experiments is also included.
The files included in the database are outlined below and the format of the files is briefly described. Additional information regarding the formatting can be found through the post-processing library (https://github.com/ahartloper/rlmtp/tree/master/protocols).
Usage
Included Files
File Format: Downsampled Data
These are the "LP_
These data files can be easily loaded using the pandas library in Python through:
import pandas
data = pandas.read_csv(data_file, index_col=0)
The data is formatted so it can be used directly in RESSPyLab (https://github.com/AlbanoCastroSousa/RESSPyLab). Note that the column names "e_true" and "Sigma_true" were kept for backwards compatibility reasons with RESSPyLab.
File Format: Unreduced Data
These are the "LP_
The data can be loaded and used similarly to the downsampled data.
File Format: Overall_Summary
The overall summary file provides data on all the test specimens in the database. The columns include:
File Format: Summarized_Mechanical_Props_Campaign
Meant to be loaded in Python as a pandas DataFrame with multi-indexing, e.g.,
tab1 = pd.read_csv('Summarized_Mechanical_Props_Campaign_' + date + version + '.csv',
index_col=[0, 1, 2, 3], skipinitialspace=True, header=[0, 1],
keep_default_na=False, na_values='')
Caveats
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
https://www.aljazeera.com/wp-content/uploads/2025/04/2025-04-09T201246Z_924829146_UP1EL491K587N_RTRMADP_3_SOCCER-CHAMPIONS-BAR-BVB-REPORT-1744236566.jpg?resize=1920%2C1440" alt="">
This dataset captures FC Barcelona's journey in the 2024–2025 UEFA Champions League, with detailed statistics scraped from FBref.com. It includes comprehensive match and player-level data covering all major performance areas such as passing, shooting, defending, goalkeeping, and more.
The data was collected using Python and Playwright, and organized into clean CSV files for easy analysis. Github Scraping Code
The dataset was collected by scraping FBref’s publicly available tables using Python and Playwright. The following tables were extracted:
| Filename | Description |
|---|---|
Standard_Stats_2024-2025_Barcelona_Champions_League.csv | Basic stats per player (games, goals, assists, etc.) |
Scores_and_Fixtures_2024-2025_Barcelona_Champions_League.csv | Match results, Dates, Formation, etc. |
Goalkeeping_2024-2025_Barcelona_Champions_League.csv | Goal against, Saves, Wins, Losses, etc. |
Advanced_Goalkeeping_2024-2025_Barcelona_Champions_League.csv | Goal against, Post-Shot Expected Goals, Throws Attempted, etc. |
Shooting_2024-2025_Barcelona_Champions_League.csv | Shot types, Goals, Pentalty kicks, etc. |
Passing_2024-2025_Barcelona_Champions_League.csv | Total passes, Passes Distance, Key Passes, Assists, Exp Assists, etc. |
Pass_Types_2024-2025_Barcelona_Champions_League.csv | Pass types, Crosses, Switches, etc. |
Goal_and_Shot_Creation_2024-2025_Barcelona_Champions_League.csv | Shot-Creating Actions, Goal-Creating Actions, etc. |
Defensive_Actions_2024-2025_Barcelona_Champions_League.csv | Tackles, Dribbles, etc. |
Possession_2024-2025_Barcelona_Champions_League.csv | Ball Touches, Carries, Take-ons, etc. |
Playing_Time_2024-2025_Barcelona_Champions_League.csv | Minutes, Starts, Substitutions, etc. |
Miscellaneous_Stats_2024-2025_Barcelona_Champions_League.csv | Fouls, Cards, Offsides, Aerials won/lost, etc. |
League_phase,_Champions_League.csv | Champions League group/phase info |
📌 Note: This dataset is not fully cleaned. It contains missing values (NaN). In addition, multiple tables need to be merged to get a complete picture of each player's performance. This makes the dataset a great opportunity for beginners to practice data cleaning, handling missing data, and combining related datasets for analysis.
Facebook
Twitterhttps://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Looking for reliable and actionable data from the Vint Marketplace? Our expertly extracted dataset is just what you need. With over 20,000 records in CSV format, this dataset is tailored to meet the needs of analysts, researchers, and businesses looking to gain valuable insights into the thriving marketplace for fine wines and spirits.
We understand the value of quality data in driving decisions. This 20k-record CSV dataset is meticulously compiled to provide structured and accessible information for your specific requirements. Whether you're conducting market research or building an e-commerce platform, this dataset offers the granular detail you need.
Unlock the potential of fine wine data with our Vint Marketplace CSV dataset. With its organized format and extensive records, it’s the perfect resource to elevate your projects. Contact us now to access the dataset and take the next step in data-driven decision-making.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This folder contains the full set of code and data for the CompuCrawl database. The database contains the archived websites of publicly traded North American firms listed in the Compustat database between 1996 and 2020\u2014representing 11,277 firms with 86,303 firm/year observations and 1,617,675 webpages in the final cleaned and selected set.The files are ordered by moment of use in the work flow. For example, the first file in the list is the input file for code files 01 and 02, which create and update the two tracking files "scrapedURLs.csv" and "URLs_1_deeper.csv" and which write HTML files to its folder. "HTML.zip" is the resultant folder, converted to .zip for ease of sharing. Code file 03 then reads this .zip file and is therefore below it in the ordering.The full set of files, in order of use, is as follows:Compustat_2021.xlsx: The input file containing the URLs to be scraped and their date range.01 Collect frontpages.py: Python script scraping the front pages of the list of URLs and generating a list of URLs one page deeper in the domains.URLs_1_deeper.csv: List of URLs one page deeper on the main domains.02 Collect further pages.py: Python script scraping the list of URLs one page deeper in the domains.scrapedURLs.csv: Tracking file containing all URLs that were accessed and their scraping status.HTML.zip: Archived version of the set of individual HTML files.03 Convert HTML to plaintext.py: Python script converting the individual HTML pages to plaintext.TXT_uncleaned.zip: Archived version of the converted yet uncleaned plaintext files.input_categorization_allpages.csv: Input file for classification of pages using GPT according to their HTML title and URL.04 GPT application.py: Python script using OpenAI\u2019s API to classify selected pages according to their HTML title and URL.categorization_applied.csv: Output file containing classification of selected pages.exclusion_list.xlsx: File containing three sheets: 'gvkeys' containing the GVKEYs of duplicate observations (that need to be excluded), 'pages' containing page IDs for pages that should be removed, and 'sentences' containing (sub-)sentences to be removed.05 Clean and select.py: Python script applying data selection and cleaning (including selection based on page category), with setting and decisions described at the top of the script. This script also combined individual pages into one combined observation per GVKEY/year.metadata.csv: Metadata containing information on all processed HTML pages, including those not selected.TXT_cleaned.zip: Archived version of the selected and cleaned plaintext page files. This file serves as input for the word embeddings application.TXT_combined.zip: Archived version of the combined plaintext files at the GVKEY/year level. This file serves as input for the data description using topic modeling.06 Topic model.R: R script that loads up the combined text data from the folder stored in "TXT_combined.zip", applies further cleaning, and estimates a 125-topic model.TM_125.RData: RData file containing the results of the 125-topic model.loadings125.csv: CSV file containing the loadings for all 125 topics for all GVKEY/year observations that were included in the topic model.125_topprob.xlsx: Overview of top-loading terms for the 125 topic model.07 Word2Vec train and align.py: Python script that loads the plaintext files in the "TXT_cleaned.zip" archive to train a series of Word2Vec models and subsequently align them in order to compare word embeddings across time periods.Word2Vec_models.zip: Archived version of the saved Word2Vec models, both unaligned and aligned.08 Word2Vec work with aligned models.py: Python script which loads the trained Word2Vec models to trace the development of the embeddings for the terms \u201csustainability\u201d and \u201cprofitability\u201d over time.99 Scrape further levels down.py: Python script that can be used to generate a list of unscraped URLs from the pages that themselves were one level deeper than the front page.URLs_2_deeper.csv: CSV file containing unscraped URLs from the pages that themselves were one level deeper than the front page.For those only interested in downloading the final database of texts, the files "HTML.zip", "TXT_uncleaned.zip", "TXT_cleaned.zip", and "TXT_combined.zip" contain the full set of HTML pages, the processed but uncleaned texts, the selected and cleaned texts, and combined and cleaned texts at the GVKEY/year level, respectively.The following webpage contains answers to frequently asked questions: https://haans-mertens.github.io/faq/. More information on the database and the underlying project can be found here: https://haans-mertens.github.io/ and the following article: \u201cThe Internet Never Forgets: A Four-Step Scraping Tutorial, Codebase, and Database for Longitudinal Organizational Website Data\u201d, by Richard F.J. Haans and Marc J. Mertens in Organizational Research Methods. The full paper can be accessed here.
Facebook
TwitterThis dataset is a clean CSV file with the most recent estimates of the population of the countries according to Wolrdometer. The data is taken from the following link: https://www.worldometers.info/world-population/population-by-country/
The data has been generated by websraping the aforementioned link on the 16th August 2021. Below is the code used to make CSV data in Python 3.8:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.worldometers.info/world-population/population-by-country/"
r = requests.get(url)
soup = BeautifulSoup(r.content)
countries = soup.find_all("table")[0]
dataframe = pd.read_html(str(countries))[0]
dataframe.to_csv("countries_by_population_2021.csv", index=False)
The creation of this dataset would not be possible without a team of Worldometers, a data aggregation website.
Facebook
TwitterThis dataset is a refined version of Alpine 1.0. It was created by generating tasks using various LLMs, wrapping them in special elements {Instruction Start} ... {Instruction End}, and saving them in a text file. We then processed this file with a Python script that used regex to extract the tasks into a CSV. Afterward, we cleaned the dataset by removing near-duplicates, vague prompts, and ambiguous entries. python clean.py -i prompts.csv -o cleaned.csv -p "prompt" -t 0.92 -l 30 This dataset… See the full description on the dataset page: https://huggingface.co/datasets/marcuscedricridia/alpine1.1-multireq-instructions-seed.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset comprises spatial and temporal data related to our analysis on blue and green water consumption (WC) of global crop production in high spatial resolution (5 arc-minutes – approximately 10 km at the equator) for the years 2020, 2010 and 2000.
Modelling water consumption of SPAM data
We use SPAM (Spatial Production Allocation Model) data, released by the International Food Policy research Institute (IFPRI). We use SPAM2020 data for the year 2020 (46 crops), SPAM2010 data for the year 2010 (42 crops) and SPAM2000 data for the year 2000 (20 crops).
We develop a Python-based global gridded crop green and blue WC assessment tool, entitled CropGBWater. Operating on a daily time scale, CropGBWater dynamically simulates rootzone water balance and related fluxes. We provide this model open access as Data_S10
SPAM2020 crop data are modelled for the years 2018-2022, SPAM2010 crop data for the years 2008-2012 and SPAM2000 crop data for the years 1998-2002. We compute WCbl (blue WC) and WCgn (green WC), with components WCgn,irr (green WC of irrigated area) and WCgn,rf (green WC of rainfed area)
File description:
The data-set consists of the following files:
Please only use the latest version of this zenodo repository
Publication:
For all details, please refer to the open access paper:
Chukalla, A.D., Mekonnen, M.M., Gunathilake, D., Wolkeba, F.T., Gunasekara, B., Vanham, D. (2025) Global spatially explicit crop water consumption shows an overall increase of 9% for 46 agricultural crops from 2010 to 2020, Nature Food, Volume 6, https://doi.org/10.1038/s43016-025-01231-x
Funding:
This research, led by IWMI, a CGIAR centre, was carried out under the CGIAR Initiative on Foresight (www.cgiar.org/initiative/foresight/) as well as the CGIAR “Policy innovations” Science Program (www.cgiar.org/cgiar-research-porfolio-2025-2030/policy-innovations). The authors would like to thank all funders who supported this research through their contributions to the CGIAR Trust Fund (www.cgiar.org/funders).
Facebook
TwitterThis dataset was created by Martin Kanju
Released under Other (specified in description)
Facebook
TwitterThe purpose of this data release is to provide data in support of the Bureau of Land Management's (BLM) Reasonably Foreseeable Development (RFD) Scenario by estimating water-use associated with oil and gas extraction methods within the BLM Carlsbad Field Office (CFO) planning area, located in Eddy and Lea Counties as well as part of Chaves County, New Mexico. Three comma separated value files and two python scripts are included in this data release. It was determined that all reported oil and gas wells within Chaves County from the FracFocus and New Mexico Oil Conservation Division (NM OCD) databases were outside of the CFO administration area and were excluded from well_records.csv and modeled_estimates.csv. Data from Chaves County are included in the produced_water.csv file to be consistent with the BLM’s water support document. Data were synthesized into comma separated values which include, produced_water.csv (volume) from NM OCD, well_records.csv (including location and completion) from NM OCD and FracFocus, and modeled_estimates.csv (using FracFocus as well as Ball and others (2020) as input data). The results from modeled_estimates.csv were obtained using a previously published regression model (McShane and McDowell, 2021) to estimate water use associated with unconventional oil and gas activities in the Permian Basin (Valder and others, 2021) for the period of interest (2010-2021). Additionally, python scripts to process, clean, and categorize FracFocus data are provided in this data release.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Data was imported from the BAK file found here into SQL Server, and then individual tables were exported as CSV. Jupyter Notebook containing the code used to clean the data can be found here
Version 6 has a some more cleaning and structuring that was noticed after importing in Power BI. Changes were made by adding code in python notebook to export new cleaned dataset, such as adding MonthNumber for sorting by month number, similar for WeekDayNumber.
Cleaning was done in python while also using SQL Server to quickly find things. Headers were added separately, ensuring no data loss.Data was cleaned for NaN, garbage values and other columns.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Overview
This dataset is designed to help build, train, and evaluate machine learning models that detect fraudulent transactions. We have included additional CSV files containing location-based scores, proprietary weights for grouping, network turn-around times, and vulnerability scores.
Key Points
- Severe Class Imbalance: Only a tiny fraction (less than 1%) of transactions are fraud.
- Multiple Feature Files: Combine them by matching on id or Group.
- Target: The Target column in train.csv indicates fraud (1) vs. clean (0).
- Goal: Predict which transactions in test_share.csv might be fraudulent.
train.csv
Target column (0 = Clean, 1 = Fraud). test_share.csv
train.csv but without the Target column.Geo_scores.csv
Lambda_wts.csv
Group.Qset_tats.csv
instance_scores.csv
Geo_scores.csv, Lambda_wts.csv, etc.) by matching id or Group. train.csv (Target ~1% is fraud). train.csv. test_share.csv or your own external data. Possible Tools:
- Python: pandas, NumPy, scikit-learn
- Imbalance Handling: SMOTE, Random Oversampler, or class weights
- Metrics: Precision, Recall, F1-score, ROC-AUC, etc.
Beginner Tip: Check how these extra CSVs (Geo, lambda, instance scores, TAT) might improve fraud detection performance!
fraud-detection classification imbalanced-data financial-transactions machine-learning python beginner-friendlyLicense: CC BY-NC-SA 4.0
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains Starlink satellite data in both CSV and TLE formats. At the top level, it includes four files: one set representing a snapshot of all Starlink satellites at a specific time and another set representing a time-range dataset for STARLINK-1008 from March 11 to April 10, 2025. Additionally, there is a folder named STARLINK_INDIVIDUAL_SATELLITE_CSV_TLE_FILES_WITH_TIME_RANGE, which contains per-satellite data files in both CSV and TLE formats. These cover the time range from January 1, 2024, to June 6, 2025, for individual satellites. The number of files varies as satellites may have been launched at different times within this period.
This dataset contains processed CSV versions of Starlink satellite data originally available from CelesTrak, a publicly available source for satellite orbital information.
CelesTrak publishes satellite position data in TLE (Two-Line Element) format, which describes a satellite’s orbit using two compact lines of text. While TLE is the standard format used by satellite agencies, it is difficult to interpret directly for beginners. So this dataset provides a cleaned and structured CSV version that is easier to use with Python and data science libraries.
Each file in the dataset corresponds to a specific Starlink satellite and contains its orbital data over a range of dates (usually 1 month). Each row is a snapshot of the satellite's position and movement at a given timestamp.
Key columns include:
| Column Name | Description |
|---|---|
| Satellite_Name | Unique identifier for each Starlink satellite. Example: STARLINK-1008. |
| Epoch | The timestamp (in UTC) representing the exact moment when the satellite's orbital data was recorded. |
| Inclination_deg | Angle between the satellite’s orbital plane and Earth’s equator. 0° means equatorial orbit; 90° means polar orbit. |
| Eccentricity | Describes the shape of the orbit. 0 = perfect circle; values approaching 1 = highly elliptical. |
| Mean_Motion_orbits_per_day | Number of orbits the satellite completes around Earth in a single day. |
| Altitude_km | Satellite’s altitude above Earth’s surface in kilometers, calculated from orbital parameters. |
| Latitude | Satellite’s geographic latitude at the recorded time. Positive = Northern Hemisphere, Negative = Southern Hemisphere. |
| Longitude | Satellite’s geographic longitude at the recorded time. Positive = East of Prime Meridian, Negative = West. |
TLE is a compact format used in aerospace and satellite communications, but:
That’s why this dataset presents the same orbital data but in a clean and normalized CSV structure ready for analysis and machine learning.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The archive contains simulator data in csv format and routine in python enabling their post traitement and plotting. A "README" file explains how to use these routines. These data have been stored during the final EFAICTS project evaluations and used in a publication also available on zenodo: 10.5281/zenodo.6796534
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains Python function and class snippets extracted from multiple public repositories. Each snippet is labeled as clean (0) or buggy (1). It is intended for training machine learning models for automated bug detection, code quality analysis, and code classification tasks.
JSON file (dataset.json) containing all code snippets.
CSV file (dataset.csv) formatted for Kaggle, with columns:
code: Python snippet
label: 0 = clean, 1 = buggy
Train ML models for code bug detection.
Experiment with static analysis, code classification, or NLP models on code.
Benchmark code analysis tools or AI assistants.
Python code from multiple public repositories was parsed to extract function and class snippets.
Each snippet was executed to determine if it raises an exception (buggy) or runs cleanly.
Additional buggy variants were generated automatically by introducing common code errors (wrong operator, division by zero, missing import, variable renaming).
~XXX snippets (you can replace with actual number)
Balanced between clean and buggy code
CC0 1.0 Universal (Public Domain) – Free to use for research and commercial purposes.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Formula 1 Comprehensive Dataset (2020-2025)
Dataset Description This comprehensive Formula 1 dataset contains detailed racing data spanning from 2020 to 2025, including race results, qualifying sessions, championship standings, circuit information, and historical driver statistics.
Perfect for:
📊 F1 performance analysis
🤖 Machine learning projects
📈 Data visualization
🏆 Championship predictions
📋 Racing statistics research
📁 Files Included 1. f1_race_results_2020_2025.csv (53 entries) Race winners and results from Grand Prix weekends
Date, Grand Prix name, race winner
Constructor, nationality, grid position
Race time, fastest lap time, points scored
Q1, Q2, Q3 session times
Grid positions, laps completed
Driver and constructor information
Points accumulation over race weekends
Wins, podiums, pole positions tracking
Season-long championship battle data
Constructor points and wins
Team performance metrics
Manufacturer rivalry data
Track length, number of turns
Lap records and record holders
Circuit designers and first F1 usage
Career wins, poles, podiums
Racing entries and achievements
Active and retired driver records
Multiple data types in one file
Ready for immediate analysis
Comprehensive F1 information hub
🔧 Data Features Clean & Structured: All data professionally format
Facebook
TwitterDataset Title: Motor Trend Car Road Tests (mtcars) Description: The data was extracted from the 1974 Motor Trend US magazine and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). It is a classic, foundational dataset used extensively in statistics and data science for learning exploratory data analysis, regression modeling, and hypothesis testing.
This dataset is a staple in the R programming language (?mtcars) and is now provided here in a clean CSV format for easy access in Python, Excel, and other data analysis environments.
Acknowledgements: This dataset was originally compiled and made available by the journal Motor Trend in 1974. It has been bundled with the R statistical programming language for decades, serving as an invaluable resource for learners and practitioners alike.
Data Dictionary: Each row represents a different car model. The columns (variables) are as follows:
Column Name Data Type Description model object (String) The name and model of the car. mpg float Miles/(US) gallon. A measure of fuel efficiency. cyl integer Number of cylinders (4, 6, 8). disp float Displacement (cubic inches). Engine size. hp integer Gross horsepower. Engine power. drat float Rear axle ratio. Affects torque and fuel economy. wt float Weight (1000 lbs). Vehicle mass. qsec float 1/4 mile time (seconds). A measure of acceleration. vs binary Engine shape (0 = V-shaped, 1 = Straight). am binary Transmission (0 = Automatic, 1 = Manual). gear integer Number of forward gears (3, 4, 5). carb integer Number of carburetors (1, 2, 3, 4, 6, 8). Key Questions & Potential Use Cases: This dataset is perfect for exploring relationships between a car's specifications and its performance. Some classic analysis questions include:
Fuel Efficiency: What factors are most predictive of a car's miles per gallon (mpg)? Is it engine size (disp), weight (wt), or horsepower (hp)?
Performance: How does transmission type (am) affect acceleration (qsec) and fuel economy (mpg)? Do manual cars perform better?
Classification: Can we accurately predict the number of cylinders (cyl) or the type of engine (vs) based on other car features?
Clustering: Are there natural groupings of cars (e.g., performance cars, economy cars) based on their specifications?
Inspiration: This is one of the most famous datasets in statistics. You can find thousands of examples, tutorials, and analyses using it online. It's an excellent starting point for:
Practicing multiple linear regression and correlation analysis.
Building your first EDA (Exploratory Data Analysis) notebook.
Learning about feature engineering and model interpretation.
Comparing statistical results from R and Python (e.g., statsmodels vs scikit-learn).
File Details: mtcars-parquet.csv: The main dataset file in CSV format.
Number of instances (rows): 32
Number of attributes (columns): 12
Missing Values? No, this is a complete dataset.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Cryptocurrency trading analysis and algorithmic strategy development rely on high-quality, high-frequency historical data. This dataset provides clean, structured OHLCV data for one of the most liquid and popular trading pairs, ETH/USDT, sourced directly from the Bybit exchange. It is ideal for quantitative analysts, data scientists, and trading enthusiasts looking to backtest strategies, perform market analysis, or build predictive models across different time horizons.
The dataset consists of three separate CSV files, each corresponding to a different time frame:
BYBIT_ETHUSDT_15m.csv: Historical data in 15-minute intervals. BYBIT_ETHUSDT_1h.csv: Historical data in 1-hour intervals. BYBIT_ETHUSDT_4h.csv: Historical data in 4-hour intervals.
Each file contains the same six columns:
This dataset is made possible by the publicly available data from the Bybit exchange. Please consider this when using the data for your projects.
Facebook
TwitterAbout the dataset (cleaned data)
The dataset (parquet file) contains approximately 1,5 million residential household sales from Denmark during the periode from 1992 to 2024. All cleaned data is merged into one parquet file here on Kaggle. Note some cleaning might still be nessesary, see notebook under code.
Also, added a random sample (100k) of the dataset as a csv file.
Done in Python version: 2.6.3.
Raw data
Raw data and more info is avaible on Github repositary: https://github.com/MartinSamFred/Danish-residential-housingPrices-1992-2024.git
The dataset has been scraped and cleaned (to some extent). Cleaned files are located in: \Housing_data_cleaned \ named DKHousingprices_1 and 2. Saved in parquet format (and saved as two files due to size).
Cleaning from raw files to above cleaned files is outlined in BoligsalgConcatCleanigGit.ipynb. (done in Python version: 2.6.3)
Webscraping script: Webscrape_script.ipynb (done in Python version: 2.6.3)
Provided you want to clean raw files from scratch yourself:
Uncleaned scraped files (81 in total) are located in \Housing_data_raw \ Housing_data_batch1 and 2. Saved in .csv format and compressed as 7-zip files.
Additional files added/appended to the Cleaned files are located in \Addtional_data and named DK_inflation_rates, DK_interest_rates, DK_morgage_rates and DK_regions_zip_codes. Saved in .xlsx format.
Content
Each row in the dataset contains a residential household sale during the period 1992 - 2024.
“Cleaned files” columns:
0 'date': is the transaction date
1 'quarter': is the quarter based on a standard calendar year
2 'house_id': unique house id (could be dropped)
3 'house_type': can be 'Villa', 'Farm', 'Summerhouse', 'Apartment', 'Townhouse'
4 'sales_type': can be 'regular_sale', 'family_sale', 'other_sale', 'auction', '-' (“-“ could be dropped)
5 'year_build': range 1000 to 2024 (could be narrowed more)
6 'purchase_price': is purchase price in DKK
7 '%_change_between_offer_and_purchase': could differ negatively, be zero or positive
8 'no_rooms': number of rooms
9 'sqm': number of square meters
10 'sqm_price': 'purchase_price' divided by 'sqm_price'
11 'address': is the address
12 'zip_code': is the zip code
13 'city': is the city
14 'area': 'East & mid jutland', 'North jutland', 'Other islands', 'Capital, Copenhagen', 'South jutland', 'North Zealand', 'Fyn & islands', 'Bornholm'
15 'region': 'Jutland', 'Zealand', 'Fyn & islands', 'Bornholm'
16 'nom_interest_rate%': Danish nominal interest rate show pr. quarter however actual rate is not converted from annualized to quarterly
17 'dk_ann_infl_rate%': Danish annual inflation rate show pr. quarter however actual rate is not converted from annualized to quarterly
18 'yield_on_mortgage_credit_bonds%': 30 year mortgage bond rate (without spread)
Uses
Various (statistical) analysis, visualisation and I assume machine learning as well.
Practice exercises etc.
Uncleaned scraped files are great to practice cleaning, especially string cleaning. I’m not an expect as seen in the coding ;-).
Disclaimer
The data and information in the data set provided here are intended to be used primarily for educational purposes only. I do not own any data, and all rights are reserved to the respective owners as outlined in “Acknowledgements/sources”. The accuracy of the dataset is not guaranteed accordingly any analysis and/or conclusions is solely at the user's own responsibly and accountability.
Acknowledgements/sources
All data is publicly available on:
Boliga: https://www.boliga.dk/
Finans Danmark: https://finansdanmark.dk/
Danmarks Statistik: https://www.dst.dk/da
Statistikbanken: https://statistikbanken.dk/statbank5a/default.asp?w=2560
Macrotrends: https://www.macrotrends.net/
PostNord: https://www.postnord.dk/
World Data: https://www.worlddata.info/
Dataset picture / cover photo: Nick Karvounis (https://unsplash.com/)
Have fun… :-)
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The Natural Questions (NQ) dataset is a comprehensive collection of real user queries submitted to Google Search, with answers sourced from Wikipedia by expert annotators. Created by Google AI Research, this dataset aims to support the development and evaluation of advanced automated question-answering systems. The version provided here includes 89,312 meticulously annotated entries, tailored for ease of access and utility in natural language processing (NLP) and machine learning (ML) research.
The dataset is composed of authentic search queries from Google Search, reflecting the wide range of information sought by users globally. This approach ensures a realistic and diverse set of questions for NLP applications.
The NQ dataset underwent significant pre-processing to prepare it for NLP tasks: - Removal of web-specific elements like URLs, hashtags, user mentions, and special characters using Python's "BeautifulSoup" and "regex" libraries. - Grammatical error identification and correction using the "LanguageTool" library, an open-source grammar, style, and spell checker.
These steps were taken to clean and simplify the text while retaining the essence of the questions and their answers, divided into 'questions', 'long answers', and 'short answers'.
The unprocessed data, including answers with embedded HTML, empty or complex long and short answers, is stored in "Natural-Questions-Base.csv". This version retains the raw structure of the data, featuring HTML elements in answers, and varied answer formats such as tables and lists, providing a comprehensive view for those interested in the original dataset's complexity and richness. The processed data is compiled into a single CSV file named "Natural-Questions-Filtered.csv". The file is structured for easy access and analysis, with each record containing the processed question, a detailed answer, and concise answer snippets.
The filtered version is available where specific criteria, such as question length or answer complexity, were applied to refine the data further. This version allows for more focused research and application development.
The repository at 'https://github.com/fujoos/natural_questions' also includes a Flask-based CSV reader application designed to read and display contents from the "NaturalQuestions.csv" file. The app provides functionalities such as: - Viewing questions and answers directly in your browser. - Filtering results based on criteria like question keywords or answer length. -See the live demo using the csv files converted to slite db at 'https://fujoos.pythonanywhere.com/'