15 datasets found

Excel projects
kaggle.com
zip
Updated Jul 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BTaffetani (2024). Excel projects [Dataset]. https://www.kaggle.com/datasets/btaffetani/excel-projects
Explore at:
zip(189455 bytes)Available download formats
Dataset updated
Jul 23, 2024
Authors
BTaffetani
Description
This is a collection of statistical projects where I used Microsoft Excel. The definition of each project was given by ProfessionAI, while the statistical analysis part was done by me. More specifically: - customer_complaints_assignment is an example of Introduction to Data Analytics where, given a dataset with complaints of customers of financial companies, tasks about filtering, counting and basic analytics were done; - trades_on_exchanges is a project for Advanced Data Analytics where statistical analysis about trading operations where done; - progetto_finale_inferenza is a project about Statistica Inference where, from a toy dataset about the population of a city, inference analysis was made.
u
Ministry of Justice Synthetic Data First Cross-Justice System Linking...
datacatalogue.ukdataservice.ac.uk
Updated Jun 18, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ministry of Justice (2025). Ministry of Justice Synthetic Data First Cross-Justice System Linking Dataset, England and Wales, 2011-2023 [Dataset]. http://doi.org/10.5255/UKDA-SN-9394-1
Explore at:
Unique identifier
https://doi.org/10.5255/UKDA-SN-9394-1
Dataset updated
Jun 18, 2025
Dataset provided by
UK Data Servicehttps://ukdataservice.ac.uk/
Authors
Ministry of Justice
Area covered
Wales
Description
The Ministry of Justice (MoJ) Data First Synthetic Data Project aims to improve engagement with Data First datasets by making synthetic versions of content available to enable more rapid development of research proposals and to thereby enhance the potential for linked administrative data to improve understanding and outcomes across justice systems. The project has led the development of two components: a dataset generation platform and an initial release of lo-fidelity, synthetic data tables.

This study includes a synthetically-generated version of the Ministry of Justice Data First cross-justice system linking dataset. Synthetic versions of all 43 tables in the MoJ Data First data ecosystem have been created. These versions can be used / joined in the same way as the real datasets. As well as underpinning training, synthetic datasets should enable researchers to explore research questions and to design research proposals prior to submitting these for approval. The code created during this exploration and design process should then enable initial results to be obtained as soon as data access is granted.

The cross-justice system linking datasets allows users to join up information from data sources across the justice system (courts, prisons, probation) and should be used in conjunction with other datasets shared as part of the Data First Programme.

Records relating to individual justice system users can be linked using unique identifiers provided for people involved. This connects people involved in different parts of the criminal justice system or that have interacted with the civil or family courts. This allows for longitudinal analysis and investigation of repeat appearances and interactions with multiple justice services, which will increase understanding around users, their pathways and outcomes.

This dataset does not itself contain information about people or their interactions with the justice system, but acts as a lookup to identify where records in other datasets are believed to relate to the same person, using our probabilistic record linkage package, Splink.

The person link table contains rows with references to all records in the individual datasets that have been linked to date plus new identifiers, generated in the linking process, which enables these records to be grouped and linked across the datasets.

Datasets currently linkable using this dataset are:

Ministry of Justice Data First magistrates’ court defendant - England and Wales
Ministry of Justice Data First Crown Court defendant - England and Wales
Ministry of Justice Data First prisoner custodial journey - England and Wales
Ministry of Justice Data First probation- England and Wales
Ministry of Justice Data First family court - England and Wales
Ministry of Justice Data First civil court - England and Wales

It is expected that this table will be extended to include more datasets in future.

The case link table links cases between the criminal courts only (for example identifying cases that began in the magistrates' court and have been committed to the Crown Court for trial or sentence, or on appeal). This allows users to follow cases from start to finish and prevent double counting.
Flight Delay Statistics Project 2024
kaggle.com
zip
Updated Nov 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cindy Zhao (2025). Flight Delay Statistics Project 2024 [Dataset]. https://www.kaggle.com/datasets/cindyxingzhao/flight-delay-statistics-project-2024
Explore at:
zip(162728114 bytes)Available download formats
Dataset updated
Nov 5, 2025
Authors
Cindy Zhao
Description
BACKGROUND The data contained in the compressed file has been extracted from the Marketing Carrier On-Time Performance (Beginning January 2018) data table of the "On-Time" database from the TranStats data library. The time period is indicated in the name of the compressed file; for example, XXX_XXXXX_2001_1 contains data of the first month of the year 2001.

RECORD LAYOUT Below are fields in the order that they appear on the records: Year Year Quarter Quarter (1-4) Month Month DayofMonth Day of Month DayOfWeek Day of Week FlightDate Flight Date (yyyymmdd) Marketing_Airline_Network Unique Marketing Carrier Code. When the same code has been used by multiple carriers, a numeric suffix is used for earlier users, for example, PA, PA(1), PA(2). Use this field for analysis across a range of years. Operated_or_Branded_Code_Share_Partners Reporting Carrier Operated or Branded Code Share Partners DOT_ID_Marketing_Airline An identification number assigned by US DOT to identify a unique airline (carrier). A unique airline (carrier) is defined as one holding and reporting under the same DOT certificate regardless of its Code, Name, or holding company/corporation. IATA_Code_Marketing_Airline Code assigned by IATA and commonly used to identify a carrier. As the same code may have been assigned to different carriers over time, the code is not always unique. For analysis, use the Unique Carrier Code. Flight_Number_Marketing_Airline Flight Number Originally_Scheduled_Code_Share_Airline Unique Scheduled Operating Carrier Code. When the same code has been used by multiple carriers, a numeric suffix is used for earlier users,for example, PA, PA(1), PA(2). Use this field for analysis across a range of years. DOT_ID_Originally_Scheduled_Code_Share_Airline An identification number assigned by US DOT to identify a unique airline (carrier). A unique airline (carrier) is defined as one holding and reporting under the same DOT certificate regardless of its Code, Name, or holding company/corporation. IATA_Code_Originally_Scheduled_Code_Share_Airline Code assigned by IATA and commonly used to identify a carrier. As the same code may have been assigned to different carriers over time, the code is not always unique. For analysis, use the Unique Carrier Code. Flight_Num_Originally_Scheduled_Code_Share_Airline Flight Number Operating_Airline Unique Carrier Code. When the same code has been used by multiple carriers, a numeric suffix is used for earlier users, for example, PA, PA(1), PA(2). Use this field for analysis across a range of years. DOT_ID_Operating_Airline An identification number assigned by US DOT to identify a unique airline (carrier). A unique airline (carrier) is defined as one holding and reporting under the same DOT certificate regardless of its Code, Name, or holding company/corporation. IATA_Code_Operating_Airline Code assigned by IATA and commonly used to identify a carrier. As the same code may have been assigned to different carriers over time, the code is not always unique. For analysis, use the Unique Carrier Code. Tail_Number Tail Number Flight_Number_Operating_Airline Flight Number OriginAirportID Origin Airport, Airport ID. An identification number assigned by US DOT to identify a unique airport. Use this field for airport analysis across a range of years because an airport can change its airport code and airport codes can be reused. OriginAirportSeqID Origin Airport, Airport Sequence ID. An identification number assigned by US DOT to identify a unique airport at a given point of time. Airport attributes, such as airport name or coordinates, may change over time. OriginCityMarketID Origin Airport, City Market ID. City Market ID is an identification number assigned by US DOT to identify a city market. Use this field to consolidate airports serving the same city market. Origin Origin Airport OriginCityName Origin Airport, City Name OriginState Origin Airport, State Code OriginStateFips Origin Airport, State Fips OriginStateName Origin Airport, State Name OriginWac Origin Airport, World Area Code DestAirportID Destination Airport, Airport ID. An identification number assigned by US DOT to identify a unique airport. Use this field for airport analysis across a range of years because an airport can change its airport code and airport codes can be reused. DestAirportSeqID Destination Airport, Airport Sequence ID. An identification number assigned by US DOT to identify a unique airport at a given point of time. Airport attributes, such as airport name or coordinates, may change over time. DestCityMarketID Destination Airport, City Market ID. City Market ID is an identification number assigned by US DOT to identify a city market. Use this field to consolidate airports serving the same city market. Dest Destination Airport DestCityName Destination Airport, City Name DestState Destination Airport, State Code DestStateFips D...
u
Project for Statistics on Living Standards and Development 1993, Merged -...
datafirst.uct.ac.za
Updated Jul 20, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Southern Africa Labour and Development Research Unit (2020). Project for Statistics on Living Standards and Development 1993, Merged - South Africa [Dataset]. http://www.datafirst.uct.ac.za/Dataportal/index.php/catalog/820
Explore at:
Dataset updated
Jul 20, 2020
Dataset authored and provided by
Southern Africa Labour and Development Research Unit
Time period covered
1993 - 1994
Area covered
South Africa
Description
Abstract

The Project for Statistics on Living standards and Development was a countrywide World Bank sponsored Living Standards Measurement Survey. It covered approximately 9000 households, drawn from a representative sample of South African households. The fieldwork was undertaken during the nine months leading up to the country's first democratic elections at the end of April 1994. The purpose of the survey was to collect data on the conditions under which South Africans live in order to provide policymakers with the data necessary for development planning. This data would aid the implementation of goals such as those outlined in the Government of National Unity's Reconstruction and Development Programme.

Geographic coverage

The survey had national coverage

Analysis unit

Households and individuals

Universe

The survey covered all household members. Individuals in hospitals, old age homes, hotels and hostels of educational institutions were not included in the sample. Migrant labour hostels were included. In addition to those that turned up in the selected ESDs, a sample of three hostels was chosen from a national list provided by the Human Sciences Research Council and within each of these hostels a representative sample was drawn for the households in ESDs.

Kind of data

Sample survey data

Mode of data collection

Face-to-face [f2f]

Research instrument

The main instrument used in the survey was a comprehensive household questionnaire. This questionnaire covered a wide range of topics but was not intended to provide exhaustive coverage of any single subject. In other words, it was an integrated questionnaire aimed at capturing different aspects of living standards. The topics covered included demographics, household services, household expenditure, educational status and expenditure, remittances and marital maintenance, land access and use, employment and income, health status and expenditure and anthropometry (children under the age of six were weighed and their heights measured). This questionnaire was available to households in two languages, namely English and Afrikaans. In addition, interviewers had in their possession a translation in the dominant African language/s of the region.

In addition to the detailed household questionnaire, a community questionnaire was administered in each cluster of the sample. The purpose of this questionnaire was to elicit information on the facilities available to the community in each cluster. Questions related primarily to the provision of education, health and recreational facilities. Furthermore there was a detailed section for the prices of a range of commodities from two retail sources in or near the cluster: a formal source such as a supermarket and a less formal one such as the "corner cafe" or a "spaza". The purpose of this latter section was to obtain a measure of regional price variation both by region and by retail source. These prices were obtained by the interviewer. For the questions relating to the provision of facilities, respondents were "prominent" members of the community such as school principals, priests and chiefs.

A literacy assessment module (LAM) was administered to two respondents in each household, (a household member 13-18 years old and a one between 18 and 50) to assess literacy levels.

Data appraisal

The data collected in clusters 217 and 218 are highly unreliable and have therefore been removed from the dataset currently available on the portal. Researchers who have downloaded the data in the past should download version 2.0 of the dataset to ensure they have the corrected data. Version 2.0 of the dataset excludes two clusters from both the 1993 and 1998 samples. During follow-up field research for the KwaZulu-Natal Income Dynamics Study (KIDS) in May 2001 it was discovered that all 39 household interviews in clusters 217 and 218 had been fabricated in both 1993 and 1998. These households have been dropped in the updated release of the data. In addition, cluster 206 is now coded as urban as this was incorrectly coded as rural in the first release of the data. Note: Weights calculated by the World Bank and provided with the original data are NOT updated to reflect these changes.
u
Steam Video Game and Bundle Data
cseweb.ucsd.edu
json
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCSD CSE Research Project, Steam Video Game and Bundle Data [Dataset]. https://cseweb.ucsd.edu/~jmcauley/datasets.html
Explore at:
jsonAvailable download formats
Dataset authored and provided by
UCSD CSE Research Project
Description
These datasets contain reviews from the Steam video game platform, and information about which games were bundled together.

Metadata includes

reviews

purchases, plays, recommends (likes)

product bundles

pricing information

Basic Statistics:

Reviews: 7,793,069

Users: 2,567,538

Items: 15,474

Bundles: 615
u
Behance Community Art Data
cseweb.ucsd.edu
json
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCSD CSE Research Project, Behance Community Art Data [Dataset]. https://cseweb.ucsd.edu/~jmcauley/datasets.html
Explore at:
jsonAvailable download formats
Dataset authored and provided by
UCSD CSE Research Project
Description
Likes and image data from the community art website Behance. This is a small, anonymized, version of a larger proprietary dataset.

Metadata includes

appreciates (likes)

timestamps

extracted image features

Basic Statistics:

Users: 63,497

Items: 178,788

Appreciates (likes): 1,000,000
u
Amazon Question and Answer Data
cseweb.ucsd.edu
json
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCSD CSE Research Project, Amazon Question and Answer Data [Dataset]. https://cseweb.ucsd.edu/~jmcauley/datasets.html
Explore at:
jsonAvailable download formats
Dataset authored and provided by
UCSD CSE Research Project
Description
These datasets contain 1.48 million question and answer pairs about products from Amazon.

Metadata includes

question and answer text

is the question binary (yes/no), and if so does it have a yes/no answer?

timestamps

product ID (to reference the review dataset)

Basic Statistics:

Questions: 1.48 million

Answers: 4,019,744

Labeled yes/no questions: 309,419

Number of unique products with questions: 191,185
UFC Complete Dataset (All events 1996-2024)
kaggle.com
zip
Updated Mar 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MaksBasher (2024). UFC Complete Dataset (All events 1996-2024) [Dataset]. https://www.kaggle.com/datasets/maksbasher/ufc-complete-dataset-all-events-1996-2024
Explore at:
zip(2149419 bytes)Available download formats
Dataset updated
Mar 28, 2024
Authors
MaksBasher
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description

This is my first public project upvotes and suggestions are appreciated 😎🖤

Project description

The UFC (Ultimate Fighting Championship) is an American mixed martial arts promotion company which is considered the biggest promotion in the MMA World. Soon they will host an anniversary event UFC 300. It is interesting to see what path the promotion has come from 1996 to this day. There are UFC datasets available in the Kaggle but all of them are outdated. For that matter I've decided to gather the new dataset which will include most of the useful stats you can do for various data analysis tasks and put my theoretical skills into practice. I've created a Python script to parse the ufcstats website and gather available data.

Currently 4 datasets are available

Large dataset

The biggest dataset yet with over 7000 rows and 95 different features to explore. Some of the ideas for projects with this dataset: - ML model for betting predictions; - Data analysis to compare different years, weight classes, fighters, etc; - In depth analysis of a specific fight or all fights of a selected fighter; - Visualisation of average stats (strikes, takedowns, subs) per weightclass, gender, years etc.

Source code for the scraper that was used to create this dataset can be found in this notebook

Medium dataset

Medium dataset for some basic tasks (contains 7582 rows and 19 columns). You can use it for getting a basic understanding of UFC historical data and perform different visualisations.

Source code for the scraper that was used to create this dataset can be found in this notebook

Small dataset

Contains the information with data about completed or upcoming events with only 683 rows and 3 columns

Source code for the scraper that was used to create this dataset can be found in this notebook

Fighter stats

A dataset with the stats for every fighter fought at the UFC event.
Energy Consumption of United States Over Time
kaggle.com
zip
Updated Dec 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). Energy Consumption of United States Over Time [Dataset]. https://www.kaggle.com/datasets/thedevastator/unlocking-the-energy-consumption-of-united-state
Explore at:
zip(222388 bytes)Available download formats
Dataset updated
Dec 14, 2022
Authors
The Devastator
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Area covered
United States
Description
Energy Consumption of United States Over Time

Building Energy Data Book

By Department of Energy [source]

About this dataset

The Building Energy Data Book (2011) is an invaluable resource for gaining insight into the current state of energy consumption in the buildings sector. This dataset provides comprehensive data on residential, commercial and industrial building energy consumption, construction techniques, building technologies and characteristics. With this resource, you can get an in-depth understanding of how energy is used in various types of buildings - from single family homes to large office complexes - as well as its impact on the environment. The BTO within the U.S Department of Energy's Office of Energy Efficiency and Renewable Energy developed this dataset to provide a wealth of knowledge for researchers, policy makers, engineers and even everyday observers who are interested in learning more about our built environment and its energy usage patterns

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset provides comprehensive information regarding energy consumption in the buildings sector of the United States. It contains a number of key variables which can be used to analyze and explore the relations between energy consumption and building characteristics, technologies, and construction. The data is provided in both CSV format as well as tabular format which can make it helpful for those who prefer to use programs like Excel or other statistical modeling software.

In order to get started with this dataset we've developed a guide outlining how to effectively use it for your research or project needs.

Understand what's included: Before you start analyzing the data, you should read through the provided documentation so that you fully understand what is included in the datasets. You'll want to be aware of any potential limitations or requirements associated with each type of data point so that your results are valid and reliable when drawing conclusions from them.

Clean up any outliers: You may need to take some time upfront investigating suspicious outliers within your dataset before using it in any further analyses — otherwise, they can skew results down the road if not dealt with first-hand! Furthermore, they could also make complex statistical modeling more difficult as well since they artificially inflate values depending on their magnitude within each example data point (i.e., one outlier could affect an entire model’s prior distributions). Missing values should also be accounted for too since these may not always appear obvious at first glance when reviewing a table or graphical representation - but accurate statistics must still be obtained either way no matter how messy things seem!

Exploratory data analysis: After cleaning up your dataset you'll want to do some basic exploring by visualizing different types of summaries like boxplots, histograms and scatter plots etc.. This will give you an initial case into what trends might exist within certain demographic/geographic/etc.. regions & variables which can then help inform future predictive models when needed! Additionally this step will highlight any clear discontinuous changes over time due over-generalization (if applicable), making sure predictors themselves don’t become part noise instead contributing meaningful signals towards overall effect predictions accuracy etc…

Analyze key metrics & observations: Once exploratory analyses have been carried out on rawsamples post-processing steps are next such as analyzing metrics such ascorrelations amongst explanatory functions; performing significance testing regression models; imputing missing/outlier values and much more depending upon specific project needs at hand… Additionally – interpretation efforts based

Research Ideas

Creating an energy efficiency rating system for buildings - Using the dataset, an organization can develop a metric to rate the energy efficiency of commercial and residential buildings in a standardized way.

Developing targeted campaigns to raise awareness about energy conservation - Analyzing data from this dataset can help organizations identify areas of high energy consumption and create targeted campaigns and incentives to encourage people to conserve energy in those areas.

Estimating costs associated with upgrading building technologies - By evaluating various trends in building technologies and their associated costs, decision-makers can determine the most cost-effective option when it comes time to upgrade their structures' energy efficiency...
Electronic Health Legal Data
kaggle.com
zip
Updated Jan 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Electronic Health Legal Data [Dataset]. https://www.kaggle.com/datasets/thedevastator/electronic-health-legal-data
Explore at:
zip(192951 bytes)Available download formats
Dataset updated
Jan 29, 2023
Authors
The Devastator
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
Electronic Health Legal Data

Exploring Laws and Regulations

By US Open Data Portal, data.gov [source]

About this dataset

This Electronic Health Information Legal Epidemiology dataset offers an extensive collection of legal and epidemiological data that can be used to understand the complexities of electronic health information. It contains a detailed balance of variables, including legal requirements, enforcement mechanisms, proprietary tools, access restrictions, privacy and security implications, data rights and responsibilities, user accounts and authentication systems. This powerful set provides researchers with real-world insights into the functioning of EHI law in order to assess its impact on patient safety and public health outcomes. With such data it is possible to gain a better understanding of current policies regarding the regulation of electronic health information as well as their potential for improvement in safeguarding patient confidentiality. Use this dataset to explore how these laws impact our healthcare system by exploring patterns across different groups over time or analyze changes leading up to new versions or updates. Make exciting discoveries with this comprehensive dataset!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

Start by familiarizing yourself with the different columns of the dataset. Examine each column closely and look up any unfamiliar terminology to get a better understanding of what the columns are referencing.

Once you understand the data and what it is intended to represent, think about how you might want to use it in your analysis. You may want to create a research question, or narrower focus for your project surrounding legal epidemiology of electronic health information that can be answered with this data set.

After creating your research plan, begin manipulating and cleaning up the data as needed in order to prepare it for analysis or visualization as specified in your project plan or research question/model design steps you have outlined .

4 .Next, perform exploratory data analysis (EDA) on relevant subsets of data from specific countries if needed on specific subsets based on targets of interests (e.g gender). Filter out irrelevant information necessary for drawing meaningful insights; analyze patterns and trends observed in your filtered datasets ; compare areas which have differing rates e-health related rules and regulations tying decisions made by elected officials strongly driven by demographics , socioeconomics factors ,ideology etc.. . Look out for correlations using statistical information as needed throughout all stages in process from filtering out dis-informative subgroups from full population set til generating visualizations(graphs/ diagrams) depicting valid insight leveraging descriptive / predictive models properly validate against reference datasets when available always keep openness principal during gathering info especially when needs requires contact external sources such validating multiple sources work best provide strong seals establishing validity accuracy facts statement representing humans case scenarios digital support suitably localized supporting local languages culture respectively while keeping secure datasets private visible limited particular users duly authorized access 5 Finally create concrete summaries reporting discoveries create share findings preferably infographics showcasing evidence observances providing overall assessment main conclusions protocols developed so far broader community indirectly related interested professionals able benefit those results ideas complete transparently freely adapted locally ported increase overall global society level enhancing potentiality range impact derive conditions allowing wider adoption increased usage diffusion capture wide spread change movement affect global e-health legal domain clear manner

Research Ideas

Studying how technology affects public health policies and practice - Using the data, researchers can look at the various types of legal regulations related to electronic health information to examine any relations between technology and public health decisions in certain areas or regions.

Evaluating trends in legal epidemiology – With this data, policymakers can identify patterns that help measure the evolution of electronic health information regulations over time and investigate why such rules are changing within different states or countries.

Analysing possible impacts on healthcare costs – Looking at changes in laws, regulations, and standards relate...
5.7M+ Records -Most Comprehensive Football Dataset
kaggle.com
zip
Updated Sep 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
salimt (2025). 5.7M+ Records -Most Comprehensive Football Dataset [Dataset]. https://www.kaggle.com/datasets/xfkzujqjvx97n/football-datasets
Explore at:
zip(85313220 bytes)Available download formats
Dataset updated
Sep 15, 2025
Authors
salimt
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
About Dataset – TL;DR

Comprehensive football (soccer) data lake from Transfermarkt, clean and structured for analysis and machine learning.

93,000+ players worldwide

2,200+ clubs across all major leagues

5.7M+ total records across 10 categories

902,000+ market valuations

1.9M+ player performance stats

1.2M+ player transfer histories

144,000+ injuries & 93,000+ national team appearances

1.3M+ teammate relationships

Everything in raw CSV format – perfect for EDA, ML, and advanced football analytics.

⚽ The Most Comprehensive Transfermarkt Football Dataset

A complete football data lake covering players, teams, transfers, performances, market values, injuries, and national team stats. Perfect for analysts, data scientists, researchers, and enthusiasts.

🗺 Entity-Relationship Overview

Here’s the high-level schema to help you understand the dataset structure:

https://i.imgur.com/WXLIx3L.png" alt="Transfermarkt Dataset ER Diagram">

📊 Key Coverage

Players: 93,000+ professional players

Teams: 2,200+ clubs, 7,700+ club relationships

Data Volume: 5.7M+ total records

Global Scope: Major leagues and competitions worldwide

🗂 Data Structure

Organized into 10 well-structured CSV categories:

Player Data (7 categories)

Player Profiles

Performances (matches, goals, assists, cards, minutes)

Market Values (historical valuations)

Transfer Histories

Injury Records

National Team Performances

Teammate Networks

Team Data (3 categories)

Team Details (club info)

Competitions & Seasons

Parent/Child Team Relations

🔗 What’s Inside?

902K+ market value records to track valuation trends

1.1M+ transfer histories with fees and movement

1.9M+ performance stats across seasons and competitions

144K+ injury records with days and matches missed

93K+ national team appearances

1.3M+ teammate relationships for chemistry analysis

💡 Why This Dataset?

Most football datasets are pre-processed and restrictive. This one is raw, rich, and flexible:

Build custom KPIs and models

Perform deep exploratory analysis (EDA)

Train machine learning and prediction pipelines

Combine with other football data sources

🚀 Example Use Cases

Predictive Modeling – Player ratings, transfer value forecasts, injury risk

Data Visualization & Dashboards – Club comparisons, performance analytics

Scouting & Recruitment – Discover undervalued talent

Network Analysis – Teammate relationships and synergy

🖥 Technical Details

Format: CSV files, UTF-8 encoded

Easy to Use: Ready for Python (pandas, numpy), R, SQL, BI tools

Scalable: 5.7M+ rows for big-data analysis

💡 Working on a Cool Project?

I’m always excited to collaborate on innovative football data projects. If you’ve got an idea, let’s make it happen together!

📬 Contact Me

GitHub: @salimt

Issues: Feel free to use GitHub Issues if you’ve got dataset-specific questions.

⭐ Support & Visibility

If this dataset helps you:
- Upvote on Kaggle
- Star the GitHub repo
- Share with others in the football analytics community

Tags

football analytics soccer dataset transfermarkt sports analytics machine learning football research player statistics

🔥 Analyze football like never before. Your next AI or analytics project starts here.
NFL_LEAGUE_DATA
kaggle.com
zip
Updated Jul 16, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chance V (2022). NFL_LEAGUE_DATA [Dataset]. https://www.kaggle.com/datasets/chancev/nfl-league-data
Explore at:
zip(46482 bytes)Available download formats
Dataset updated
Jul 16, 2022
Authors
Chance V
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset was scraped from tables found at https://www.nfl.com/standings/league/2021/REG (for each year). This dataset contains each individual year's LEAGUE data by team. There is also a master file which has compiled all the data into one csv for analysis and comparison by year. The column description can be found at the bottom of this section. If you are interested in the code used to scrape the data, you can view the full project details at https://github.com/cvandergeugten/NFL-LEAGUE-DATA/blob/main/nfl_league_data_scraper.py

Challenge:

This dataset replicates the table found on the NFL's website exactly. There are some columns that can be cleaned up, renamed, or altered to allow use for analysis. There are also columns that can be used to create new features to be used in analysis. For those that want some practice on tidying up datasets and using them for predictive modeling or exploratory analysis, here is a list of objectives you can try to accomplish with this data:

1. Change names of PCT columns to reflect which stats they are calculating the percentage for.

2. Ideas for feature engineering (creating new features):

Extract information from the 'record' columns (Home, Road, Division). These columns are not formatted to be directly used for analysis so you can create new columns that indicate each statistics individually. For example, you can create a new column called "Home Wins" and then write some code to extract the number of wins from the 'Home' column. Repeat with 'Home Losses' and 'Home Ties'. If you do this for each record column, you will have transformed all that information into useable data for modeling and analysis.

Create a feature called 'Undefeated' which will be a binary categorical variable. Input a 1 if the team never lost a game in that particular record column, and put a 0 if that team had any losses within that record. Repeat for all the different record columns (you might want to specify the record in the variable like this: 'Undefeated Home')

Create new columns for the winning and losing streak's value. You can name two columns 'Win Streak #' and 'Lose Streak #' and then write some code that will extract that information from the 'Strk' column. If a team was on a winning streak, then the value for their 'Lose Streak #' should be 0.

Create new columns that indicate which division a team is in!

Have some fun and engineer some of your own features!!

3. Use the data to answer these questions:

Over the last 21 years, who has been the best/worst performing teams?

Which teams perform better at home and which teams perform better on the road?

Which teams tie the most?

Pick your favorite team! What were they best years for this team in terms of performance? Did they ever go undefeated?

Column Info:

NFL Team: Team name (includes the name of the home city)

W: Total number of wins

L: Total number of losses

T: Number of ties

PCT: Win percentage

PF: Total points scored for the team

PA: Total points scored against the team

Net Pts: Net points

Home: Home record

Road: Road record

Div: Division record

Pct: Win percentage (for division record)

Conf: Conference record

Pct.1: Win percentage (for conference record)

Non-conf: Non-Conference record

Strk: Win or Loss streak

Last 5: Record from last 5 games played

Year: Year of the stats
Real Madrid UEFA Champions League Perform Analysis
kaggle.com
zip
Updated Aug 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joaco Romero Flores (2023). Real Madrid UEFA Champions League Perform Analysis [Dataset]. https://www.kaggle.com/datasets/joaquinaromerof/real-madrid-analysis
Explore at:
zip(32668239 bytes)Available download formats
Dataset updated
Aug 26, 2023
Authors
Joaco Romero Flores
License
https://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/
Description
Introduction

In the high-stakes world of professional football, public opinion often forms around emotions, loyalties, and subjective interpretations. The project at hand aims to transcend these biases by delving into a robust, data-driven analysis of Real Madrid's performance in the UEFA Champions League over the past decade.

Through a blend of traditional statistical methods, machine learning models, game theory, psychology, philosophy, and even military strategies, this investigation presents a multifaceted view of what contributes to a football team's success and how performance can be objectively evaluated.

Exploratory Data Analysis (EDA)

The EDA consists of two layers:

1. Statistical Analysis:

Set-Up Process: Loading libraries, data frames, determining position relevancy, and calculating average minutes played.

Kurtosis: Understanding data variance and its internal behavior.

Feature Engineering: Preprocessing with standard scaler for later ML applications.

Sample Statistics, Distribution, and Standard Errors: Essential for inference.

Central Limit Theorem: A focus for understanding by experienced data scientists.

A/B Testing & ANOVA: Used for null hypothesis testing.

2. Machine Learning Models:

Ordinary Least Square: To estimate the unknown parameters.

Linear Regression Models with Sci-Kit Learn: Predicting the dependent variable.

XGBoost & Cross-Validation: A powerful algorithm for making predictions.

Conformal Prediction: To create valid prediction regions.

Radar Maps: For visualizing player performance during their match campaigns.

Objectives

The goal of this analysis is multifaceted: 1. Unveil Hidden Statistics: To reveal the underlying patterns often overlooked in casual discussions. 2. Demonstrate the Impact of Probability: How it shapes matches and seasons. 3. Explore Interdisciplinary Influences: Including Game Theory, Strategy, Cooperation, Psychology, Physiology, Military Training, Luck, Economics, Philosophy, and even Freudian Analysis. 4. Challenge Subjective Bias: By presenting a well-rounded, evidence-based view of football performance.

Conclusion

This project stands as a testament to the profound complexity of football performance and the nuanced insights that can be derived through rigorous scientific analysis. Whether a data scientist recruiter, football fanatic, or curious mind, the findings herein offer a unique perspective that bridges the gap between passion and empiricism.
English Wikipedia People Dataset
kaggle.com
zip
Updated Jul 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wikimedia (2025). English Wikipedia People Dataset [Dataset]. https://www.kaggle.com/datasets/wikimedia-foundation/english-wikipedia-people-dataset
Explore at:
zip(4293465577 bytes)Available download formats
Dataset updated
Jul 31, 2025
Dataset provided by
Wikimedia Foundationhttp://www.wikimedia.org/
Authors
Wikimedia
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Summary

This dataset contains biographical information derived from articles on English Wikipedia as it stood in early June 2024. It was created as part of the Structured Contents initiative at Wikimedia Enterprise and is intended for evaluation and research use.

The beta sample dataset is a subset of the Structured Contents Snapshot focusing on people with infoboxes in EN wikipedia; outputted as json files (compressed in tar.gz).

We warmly welcome any feedback you have. Please share your thoughts, suggestions, and any issues you encounter on the discussion page for this dataset here on Kaggle.

Data Structure

File name: wme_people_infobox.tar.gz

Size of compressed file: 4.12 GB

Size of uncompressed file: 21.28 GB

Noteworthy Included Fields: - name - title of the article. - identifier - ID of the article. - image - main image representing the article's subject. - description - one-sentence description of the article for quick reference. - abstract - lead section, summarizing what the article is about. - infoboxes - parsed information from the side panel (infobox) on the Wikipedia article. - sections - parsed sections of the article, including links. Note: excludes other media/images, lists, tables and references or similar non-prose sections.

The Wikimedia Enterprise Data Dictionary explains all of the fields in this dataset.

Stats

Infoboxes - Compressed: 2GB - Uncompressed: 11GB

Infoboxes + sections + short description - Size of compressed file: 4.12 GB - Size of uncompressed file: 21.28 GB

Article analysis and filtering breakdown: - total # of articles analyzed: 6,940,949 - # people found with QID: 1,778,226 - # people found with Category: 158,996 - people found with Biography Project: 76,150 - Total # of people articles found: 2,013,372 - Total # people articles with infoboxes: 1,559,985 End stats - Total number of people articles in this dataset: 1,559,985 - that have a short description: 1,416,701 - that have an infobox: 1,559,985 - that have article sections: 1,559,921

This dataset includes 235,146 people articles that exist on Wikipedia but aren't yet tagged on Wikidata as instance of:human.

Maintenance and Support

This dataset was originally extracted from the Wikimedia Enterprise APIs on June 5, 2024. The information in this dataset may therefore be out of date. This dataset isn't being actively updated or maintained, and has been shared for community use and feedback. If you'd like to retrieve up-to-date Wikipedia articles or data from other Wikiprojects, get started with Wikimedia Enterprise's APIs

Initial Data Collection and Normalization

The dataset is built from the Wikimedia Enterprise HTML “snapshots”: https://enterprise.wikimedia.com/docs/snapshot/ and focuses on the Wikipedia article namespace (namespace 0 (main)).

Who are the source language producers?

Wikipedia is a human generated corpus of free knowledge, written, edited, and curated by a global community of editors since 2001. It is the largest and most accessed educational resource in history, accessed over 20 billion times by half a billion people each month. Wikipedia represents almost 25 years of work by its community; the creation, curation, and maintenance of millions of articles on distinct topics. This dataset includes the biographical contents of English Wikipedia language editions: English https://en.wikipedia.org/, written by the community.

Attribution

Terms and conditions

Wikimedia Enterprise provides this dataset under the assumption that downstream users will adhere to the relevant free culture licenses when the data is reused. In situations where attribution is required, reusers should identify the Wikimedia project from which the content was retrieved as the source of the content. Any attribution should adhere to Wikimedia’s trademark policy (available at https://foundation.wikimedia.org/wiki/Trademark_policy) and visual identity guidelines (ava...
All Time Premier League Player Statistics
kaggle.com
zip
Updated Sep 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rishikesh Kanabar (2020). All Time Premier League Player Statistics [Dataset]. https://www.kaggle.com/rishikeshkanabar/premier-league-player-statistics-updated-daily
Explore at:
zip(35234 bytes)Available download formats
Dataset updated
Sep 24, 2020
Authors
Rishikesh Kanabar
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Context

I am a really huge football fan and the Premier League is one of my favourite football (or soccer, whatever you like to call it) leagues. So, as my very first dataset, I thought this would be a great opportunity for me to make a dataset of player statistics of all seasons from the Premier League.

The Premier League, often referred to as the English Premier League or the EPL outside England, is the top level of the English football league system. Contested by 20 clubs, it operates on a system of promotion and relegation with the English Football League (EFL). Contested by 20 clubs, it operates on a system of promotion and relegation with the English Football League.

Home to some of the most famous clubs, players, managers and stadiums in world football, the Premier League is the most-watched league on the planet with one billion homes watching the action in 188 countries.The league takes place between August and May and involves the teams playing each other home and away across the season, a total of 380 matches.

Three points are awarded for a win, one point for a draw and none for a defeat, with the team with the most points at the end of the season winning the Premier League title. The teams that finish in the bottom three of the league table at the end of the campaign are relegated to the Championship, the second tier of English football. Those teams are replaced by three clubs promoted from the Championship; the sides that finish in first and second place and the third via the end-of-season playoffs.

Details about the dataset

Some players of certain position may not have certain statistics - For example, A goalkeeper may not have a statistic for "Shot Accuracy"

The format for the filename is - dataset - {yyyy-mm-dd Date} (The date is date when the file was last updated on)

Content

The data was acquired from:

https://www.premierleague.com/

I made a BeautifulSoup4 Web Scrapper in Python3 which automatically outputs a csv file of all the player statistics. The runtime of the file is about 20 minutes but it varies with the bandwidth of the Internet connection. I made this program so that this dataset could be updated weekly. The reason for weekly update is that the statistics change after each match played by the player so I felt that for the most up-to-date results, such a program is needed. Planning this project took 2 days. Making the program in Python3 took 7 days and the testing and bug fixing took another 5 days. The project was completed in the span of 2 weeks.

Acknowledgements

Source credits : https://www.premierleague.com/ Image credits : https://rb.gy/wuiwth

Inspiration

How do variables like age, nationality and club affect the player performance?

Known issues in the dataset

Goals per match displays an abnormally high value for a few players as the HTML displays incorrect value during first few milliseconds of loading the page. I am trying to fix it analytically rather than scrapping directly from the website.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

BTaffetani (2024). Excel projects [Dataset]. https://www.kaggle.com/datasets/btaffetani/excel-projects

Excel projects

Explore at:

zip(189455 bytes)Available download formats

Dataset updated

Jul 23, 2024

Authors

BTaffetani

Description

This is a collection of statistical projects where I used Microsoft Excel. The definition of each project was given by ProfessionAI, while the statistical analysis part was done by me. More specifically: - customer_complaints_assignment is an example of Introduction to Data Analytics where, given a dataset with complaints of customers of financial companies, tasks about filtering, counting and basic analytics were done; - trades_on_exchanges is a project for Advanced Data Analytics where statistical analysis about trading operations where done; - progetto_finale_inferenza is a project about Statistica Inference where, from a toy dataset about the population of a city, inference analysis was made.

Clear search

Close search

Google apps

Main menu

Excel projects

Ministry of Justice Synthetic Data First Cross-Justice System Linking...

Flight Delay Statistics Project 2024

Project for Statistics on Living Standards and Development 1993, Merged -...

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Mode of data collection

Research instrument

Data appraisal

Steam Video Game and Bundle Data

Behance Community Art Data

Amazon Question and Answer Data

UFC Complete Dataset (All events 1996-2024)

Project description

Large dataset

Medium dataset

Small dataset

Fighter stats

Energy Consumption of United States Over Time

Energy Consumption of United States Over Time

Building Energy Data Book

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Electronic Health Legal Data

Electronic Health Legal Data

Exploring Laws and Regulations

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

5.7M+ Records -Most Comprehensive Football Dataset

About Dataset – TL;DR

⚽ The Most Comprehensive Transfermarkt Football Dataset

🗺 Entity-Relationship Overview

📊 Key Coverage

🗂 Data Structure

Player Data (7 categories)

Team Data (3 categories)

🔗 What’s Inside?

💡 Why This Dataset?

🚀 Example Use Cases

🖥 Technical Details

💡 Working on a Cool Project?

📬 Contact Me

⭐ Support & Visibility

Tags

NFL_LEAGUE_DATA

Challenge:

1. Change names of PCT columns to reflect which stats they are calculating the percentage for.

2. Ideas for feature engineering (creating new features):

3. Use the data to answer these questions:

Column Info:

Real Madrid UEFA Champions League Perform Analysis

Introduction

Exploratory Data Analysis (EDA)

1. Statistical Analysis:

2. Machine Learning Models:

Objectives

Conclusion

English Wikipedia People Dataset

Summary

Data Structure

Stats

Maintenance and Support

Initial Data Collection and Normalization

Who are the source language producers?

Attribution

All Time Premier League Player Statistics

Context

Details about the dataset

Content

Acknowledgements

Inspiration