Facebook
TwitterThis is a collection of statistical projects where I used Microsoft Excel. The definition of each project was given by ProfessionAI, while the statistical analysis part was done by me. More specifically: - customer_complaints_assignment is an example of Introduction to Data Analytics where, given a dataset with complaints of customers of financial companies, tasks about filtering, counting and basic analytics were done; - trades_on_exchanges is a project for Advanced Data Analytics where statistical analysis about trading operations where done; - progetto_finale_inferenza is a project about Statistica Inference where, from a toy dataset about the population of a city, inference analysis was made.
Facebook
TwitterThis study includes a synthetically-generated version of the Ministry of Justice Data First cross-justice system linking dataset. Synthetic versions of all 43 tables in the MoJ Data First data ecosystem have been created. These versions can be used / joined in the same way as the real datasets. As well as underpinning training, synthetic datasets should enable researchers to explore research questions and to design research proposals prior to submitting these for approval. The code created during this exploration and design process should then enable initial results to be obtained as soon as data access is granted.
The cross-justice system linking datasets allows users to join up information from data sources across the justice system (courts, prisons, probation) and should be used in conjunction with other datasets shared as part of the Data First Programme.
Records relating to individual justice system users can be linked using unique identifiers provided for people involved. This connects people involved in different parts of the criminal justice system or that have interacted with the civil or family courts. This allows for longitudinal analysis and investigation of repeat appearances and interactions with multiple justice services, which will increase understanding around users, their pathways and outcomes.
This dataset does not itself contain information about people or their interactions with the justice system, but acts as a lookup to identify where records in other datasets are believed to relate to the same person, using our probabilistic record linkage package, Splink.
The person link table contains rows with references to all records in the individual datasets that have been linked to date plus new identifiers, generated in the linking process, which enables these records to be grouped and linked across the datasets.
Datasets currently linkable using this dataset are:
It is expected that this table will be extended to include more datasets in future.
The case link table links cases between the criminal courts only (for example identifying cases that began in the magistrates' court and have been committed to the Crown Court for trial or sentence, or on appeal). This allows users to follow cases from start to finish and prevent double counting.
Facebook
TwitterBACKGROUND The data contained in the compressed file has been extracted from the Marketing Carrier On-Time Performance (Beginning January 2018) data table of the "On-Time" database from the TranStats data library. The time period is indicated in the name of the compressed file; for example, XXX_XXXXX_2001_1 contains data of the first month of the year 2001.
RECORD LAYOUT Below are fields in the order that they appear on the records: Year Year Quarter Quarter (1-4) Month Month DayofMonth Day of Month DayOfWeek Day of Week FlightDate Flight Date (yyyymmdd) Marketing_Airline_Network Unique Marketing Carrier Code. When the same code has been used by multiple carriers, a numeric suffix is used for earlier users, for example, PA, PA(1), PA(2). Use this field for analysis across a range of years. Operated_or_Branded_Code_Share_Partners Reporting Carrier Operated or Branded Code Share Partners DOT_ID_Marketing_Airline An identification number assigned by US DOT to identify a unique airline (carrier). A unique airline (carrier) is defined as one holding and reporting under the same DOT certificate regardless of its Code, Name, or holding company/corporation. IATA_Code_Marketing_Airline Code assigned by IATA and commonly used to identify a carrier. As the same code may have been assigned to different carriers over time, the code is not always unique. For analysis, use the Unique Carrier Code. Flight_Number_Marketing_Airline Flight Number Originally_Scheduled_Code_Share_Airline Unique Scheduled Operating Carrier Code. When the same code has been used by multiple carriers, a numeric suffix is used for earlier users,for example, PA, PA(1), PA(2). Use this field for analysis across a range of years. DOT_ID_Originally_Scheduled_Code_Share_Airline An identification number assigned by US DOT to identify a unique airline (carrier). A unique airline (carrier) is defined as one holding and reporting under the same DOT certificate regardless of its Code, Name, or holding company/corporation. IATA_Code_Originally_Scheduled_Code_Share_Airline Code assigned by IATA and commonly used to identify a carrier. As the same code may have been assigned to different carriers over time, the code is not always unique. For analysis, use the Unique Carrier Code. Flight_Num_Originally_Scheduled_Code_Share_Airline Flight Number Operating_Airline Unique Carrier Code. When the same code has been used by multiple carriers, a numeric suffix is used for earlier users, for example, PA, PA(1), PA(2). Use this field for analysis across a range of years. DOT_ID_Operating_Airline An identification number assigned by US DOT to identify a unique airline (carrier). A unique airline (carrier) is defined as one holding and reporting under the same DOT certificate regardless of its Code, Name, or holding company/corporation. IATA_Code_Operating_Airline Code assigned by IATA and commonly used to identify a carrier. As the same code may have been assigned to different carriers over time, the code is not always unique. For analysis, use the Unique Carrier Code. Tail_Number Tail Number Flight_Number_Operating_Airline Flight Number OriginAirportID Origin Airport, Airport ID. An identification number assigned by US DOT to identify a unique airport. Use this field for airport analysis across a range of years because an airport can change its airport code and airport codes can be reused. OriginAirportSeqID Origin Airport, Airport Sequence ID. An identification number assigned by US DOT to identify a unique airport at a given point of time. Airport attributes, such as airport name or coordinates, may change over time. OriginCityMarketID Origin Airport, City Market ID. City Market ID is an identification number assigned by US DOT to identify a city market. Use this field to consolidate airports serving the same city market. Origin Origin Airport OriginCityName Origin Airport, City Name OriginState Origin Airport, State Code OriginStateFips Origin Airport, State Fips OriginStateName Origin Airport, State Name OriginWac Origin Airport, World Area Code DestAirportID Destination Airport, Airport ID. An identification number assigned by US DOT to identify a unique airport. Use this field for airport analysis across a range of years because an airport can change its airport code and airport codes can be reused. DestAirportSeqID Destination Airport, Airport Sequence ID. An identification number assigned by US DOT to identify a unique airport at a given point of time. Airport attributes, such as airport name or coordinates, may change over time. DestCityMarketID Destination Airport, City Market ID. City Market ID is an identification number assigned by US DOT to identify a city market. Use this field to consolidate airports serving the same city market. Dest Destination Airport DestCityName Destination Airport, City Name DestState Destination Airport, State Code DestStateFips D...
Facebook
TwitterThe Project for Statistics on Living standards and Development was a countrywide World Bank sponsored Living Standards Measurement Survey. It covered approximately 9000 households, drawn from a representative sample of South African households. The fieldwork was undertaken during the nine months leading up to the country's first democratic elections at the end of April 1994. The purpose of the survey was to collect data on the conditions under which South Africans live in order to provide policymakers with the data necessary for development planning. This data would aid the implementation of goals such as those outlined in the Government of National Unity's Reconstruction and Development Programme.
The survey had national coverage
Households and individuals
The survey covered all household members. Individuals in hospitals, old age homes, hotels and hostels of educational institutions were not included in the sample. Migrant labour hostels were included. In addition to those that turned up in the selected ESDs, a sample of three hostels was chosen from a national list provided by the Human Sciences Research Council and within each of these hostels a representative sample was drawn for the households in ESDs.
Sample survey data
Face-to-face [f2f]
The main instrument used in the survey was a comprehensive household questionnaire. This questionnaire covered a wide range of topics but was not intended to provide exhaustive coverage of any single subject. In other words, it was an integrated questionnaire aimed at capturing different aspects of living standards. The topics covered included demographics, household services, household expenditure, educational status and expenditure, remittances and marital maintenance, land access and use, employment and income, health status and expenditure and anthropometry (children under the age of six were weighed and their heights measured). This questionnaire was available to households in two languages, namely English and Afrikaans. In addition, interviewers had in their possession a translation in the dominant African language/s of the region.
In addition to the detailed household questionnaire, a community questionnaire was administered in each cluster of the sample. The purpose of this questionnaire was to elicit information on the facilities available to the community in each cluster. Questions related primarily to the provision of education, health and recreational facilities. Furthermore there was a detailed section for the prices of a range of commodities from two retail sources in or near the cluster: a formal source such as a supermarket and a less formal one such as the "corner cafe" or a "spaza". The purpose of this latter section was to obtain a measure of regional price variation both by region and by retail source. These prices were obtained by the interviewer. For the questions relating to the provision of facilities, respondents were "prominent" members of the community such as school principals, priests and chiefs.
A literacy assessment module (LAM) was administered to two respondents in each household, (a household member 13-18 years old and a one between 18 and 50) to assess literacy levels.
The data collected in clusters 217 and 218 are highly unreliable and have therefore been removed from the dataset currently available on the portal. Researchers who have downloaded the data in the past should download version 2.0 of the dataset to ensure they have the corrected data. Version 2.0 of the dataset excludes two clusters from both the 1993 and 1998 samples. During follow-up field research for the KwaZulu-Natal Income Dynamics Study (KIDS) in May 2001 it was discovered that all 39 household interviews in clusters 217 and 218 had been fabricated in both 1993 and 1998. These households have been dropped in the updated release of the data. In addition, cluster 206 is now coded as urban as this was incorrectly coded as rural in the first release of the data. Note: Weights calculated by the World Bank and provided with the original data are NOT updated to reflect these changes.
Facebook
TwitterThese datasets contain reviews from the Steam video game platform, and information about which games were bundled together.
Metadata includes
reviews
purchases, plays, recommends (likes)
product bundles
pricing information
Basic Statistics:
Reviews: 7,793,069
Users: 2,567,538
Items: 15,474
Bundles: 615
Facebook
TwitterLikes and image data from the community art website Behance. This is a small, anonymized, version of a larger proprietary dataset.
Metadata includes
appreciates (likes)
timestamps
extracted image features
Basic Statistics:
Users: 63,497
Items: 178,788
Appreciates (likes): 1,000,000
Facebook
TwitterThese datasets contain 1.48 million question and answer pairs about products from Amazon.
Metadata includes
question and answer text
is the question binary (yes/no), and if so does it have a yes/no answer?
timestamps
product ID (to reference the review dataset)
Basic Statistics:
Questions: 1.48 million
Answers: 4,019,744
Labeled yes/no questions: 309,419
Number of unique products with questions: 191,185
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This is my first public project upvotes and suggestions are appreciated 😎🖤
The UFC (Ultimate Fighting Championship) is an American mixed martial arts promotion company which is considered the biggest promotion in the MMA World. Soon they will host an anniversary event UFC 300. It is interesting to see what path the promotion has come from 1996 to this day. There are UFC datasets available in the Kaggle but all of them are outdated. For that matter I've decided to gather the new dataset which will include most of the useful stats you can do for various data analysis tasks and put my theoretical skills into practice. I've created a Python script to parse the ufcstats website and gather available data.
Currently 4 datasets are available
The biggest dataset yet with over 7000 rows and 95 different features to explore. Some of the ideas for projects with this dataset: - ML model for betting predictions; - Data analysis to compare different years, weight classes, fighters, etc; - In depth analysis of a specific fight or all fights of a selected fighter; - Visualisation of average stats (strikes, takedowns, subs) per weightclass, gender, years etc.
Source code for the scraper that was used to create this dataset can be found in this notebook
Medium dataset for some basic tasks (contains 7582 rows and 19 columns). You can use it for getting a basic understanding of UFC historical data and perform different visualisations.
Source code for the scraper that was used to create this dataset can be found in this notebook
Contains the information with data about completed or upcoming events with only 683 rows and 3 columns
Source code for the scraper that was used to create this dataset can be found in this notebook
A dataset with the stats for every fighter fought at the UFC event.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
By Department of Energy [source]
The Building Energy Data Book (2011) is an invaluable resource for gaining insight into the current state of energy consumption in the buildings sector. This dataset provides comprehensive data on residential, commercial and industrial building energy consumption, construction techniques, building technologies and characteristics. With this resource, you can get an in-depth understanding of how energy is used in various types of buildings - from single family homes to large office complexes - as well as its impact on the environment. The BTO within the U.S Department of Energy's Office of Energy Efficiency and Renewable Energy developed this dataset to provide a wealth of knowledge for researchers, policy makers, engineers and even everyday observers who are interested in learning more about our built environment and its energy usage patterns
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset provides comprehensive information regarding energy consumption in the buildings sector of the United States. It contains a number of key variables which can be used to analyze and explore the relations between energy consumption and building characteristics, technologies, and construction. The data is provided in both CSV format as well as tabular format which can make it helpful for those who prefer to use programs like Excel or other statistical modeling software.
In order to get started with this dataset we've developed a guide outlining how to effectively use it for your research or project needs.
Understand what's included: Before you start analyzing the data, you should read through the provided documentation so that you fully understand what is included in the datasets. You'll want to be aware of any potential limitations or requirements associated with each type of data point so that your results are valid and reliable when drawing conclusions from them.
Clean up any outliers: You may need to take some time upfront investigating suspicious outliers within your dataset before using it in any further analyses — otherwise, they can skew results down the road if not dealt with first-hand! Furthermore, they could also make complex statistical modeling more difficult as well since they artificially inflate values depending on their magnitude within each example data point (i.e., one outlier could affect an entire model’s prior distributions). Missing values should also be accounted for too since these may not always appear obvious at first glance when reviewing a table or graphical representation - but accurate statistics must still be obtained either way no matter how messy things seem!
Exploratory data analysis: After cleaning up your dataset you'll want to do some basic exploring by visualizing different types of summaries like boxplots, histograms and scatter plots etc.. This will give you an initial case into what trends might exist within certain demographic/geographic/etc.. regions & variables which can then help inform future predictive models when needed! Additionally this step will highlight any clear discontinuous changes over time due over-generalization (if applicable), making sure predictors themselves don’t become part noise instead contributing meaningful signals towards overall effect predictions accuracy etc…
Analyze key metrics & observations: Once exploratory analyses have been carried out on rawsamples post-processing steps are next such as analyzing metrics such ascorrelations amongst explanatory functions; performing significance testing regression models; imputing missing/outlier values and much more depending upon specific project needs at hand… Additionally – interpretation efforts based
- Creating an energy efficiency rating system for buildings - Using the dataset, an organization can develop a metric to rate the energy efficiency of commercial and residential buildings in a standardized way.
- Developing targeted campaigns to raise awareness about energy conservation - Analyzing data from this dataset can help organizations identify areas of high energy consumption and create targeted campaigns and incentives to encourage people to conserve energy in those areas.
- Estimating costs associated with upgrading building technologies - By evaluating various trends in building technologies and their associated costs, decision-makers can determine the most cost-effective option when it comes time to upgrade their structures' energy efficiency...
Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
By US Open Data Portal, data.gov [source]
This Electronic Health Information Legal Epidemiology dataset offers an extensive collection of legal and epidemiological data that can be used to understand the complexities of electronic health information. It contains a detailed balance of variables, including legal requirements, enforcement mechanisms, proprietary tools, access restrictions, privacy and security implications, data rights and responsibilities, user accounts and authentication systems. This powerful set provides researchers with real-world insights into the functioning of EHI law in order to assess its impact on patient safety and public health outcomes. With such data it is possible to gain a better understanding of current policies regarding the regulation of electronic health information as well as their potential for improvement in safeguarding patient confidentiality. Use this dataset to explore how these laws impact our healthcare system by exploring patterns across different groups over time or analyze changes leading up to new versions or updates. Make exciting discoveries with this comprehensive dataset!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
Start by familiarizing yourself with the different columns of the dataset. Examine each column closely and look up any unfamiliar terminology to get a better understanding of what the columns are referencing.
Once you understand the data and what it is intended to represent, think about how you might want to use it in your analysis. You may want to create a research question, or narrower focus for your project surrounding legal epidemiology of electronic health information that can be answered with this data set.
After creating your research plan, begin manipulating and cleaning up the data as needed in order to prepare it for analysis or visualization as specified in your project plan or research question/model design steps you have outlined .
4 .Next, perform exploratory data analysis (EDA) on relevant subsets of data from specific countries if needed on specific subsets based on targets of interests (e.g gender). Filter out irrelevant information necessary for drawing meaningful insights; analyze patterns and trends observed in your filtered datasets ; compare areas which have differing rates e-health related rules and regulations tying decisions made by elected officials strongly driven by demographics , socioeconomics factors ,ideology etc.. . Look out for correlations using statistical information as needed throughout all stages in process from filtering out dis-informative subgroups from full population set til generating visualizations(graphs/ diagrams) depicting valid insight leveraging descriptive / predictive models properly validate against reference datasets when available always keep openness principal during gathering info especially when needs requires contact external sources such validating multiple sources work best provide strong seals establishing validity accuracy facts statement representing humans case scenarios digital support suitably localized supporting local languages culture respectively while keeping secure datasets private visible limited particular users duly authorized access 5 Finally create concrete summaries reporting discoveries create share findings preferably infographics showcasing evidence observances providing overall assessment main conclusions protocols developed so far broader community indirectly related interested professionals able benefit those results ideas complete transparently freely adapted locally ported increase overall global society level enhancing potentiality range impact derive conditions allowing wider adoption increased usage diffusion capture wide spread change movement affect global e-health legal domain clear manner
- Studying how technology affects public health policies and practice - Using the data, researchers can look at the various types of legal regulations related to electronic health information to examine any relations between technology and public health decisions in certain areas or regions.
- Evaluating trends in legal epidemiology – With this data, policymakers can identify patterns that help measure the evolution of electronic health information regulations over time and investigate why such rules are changing within different states or countries.
- Analysing possible impacts on healthcare costs – Looking at changes in laws, regulations, and standards relate...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Comprehensive football (soccer) data lake from Transfermarkt, clean and structured for analysis and machine learning.
Everything in raw CSV format – perfect for EDA, ML, and advanced football analytics.
A complete football data lake covering players, teams, transfers, performances, market values, injuries, and national team stats. Perfect for analysts, data scientists, researchers, and enthusiasts.
Here’s the high-level schema to help you understand the dataset structure:
https://i.imgur.com/WXLIx3L.png" alt="Transfermarkt Dataset ER Diagram">
Organized into 10 well-structured CSV categories:
Most football datasets are pre-processed and restrictive. This one is raw, rich, and flexible:
I’m always excited to collaborate on innovative football data projects. If you’ve got an idea, let’s make it happen together!
If this dataset helps you:
- Upvote on Kaggle
- Star the GitHub repo
- Share with others in the football analytics community
football analytics soccer dataset transfermarkt sports analytics machine learning football research player statistics
🔥 Analyze football like never before. Your next AI or analytics project starts here.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was scraped from tables found at https://www.nfl.com/standings/league/2021/REG (for each year). This dataset contains each individual year's LEAGUE data by team. There is also a master file which has compiled all the data into one csv for analysis and comparison by year. The column description can be found at the bottom of this section. If you are interested in the code used to scrape the data, you can view the full project details at https://github.com/cvandergeugten/NFL-LEAGUE-DATA/blob/main/nfl_league_data_scraper.py
This dataset replicates the table found on the NFL's website exactly. There are some columns that can be cleaned up, renamed, or altered to allow use for analysis. There are also columns that can be used to create new features to be used in analysis. For those that want some practice on tidying up datasets and using them for predictive modeling or exploratory analysis, here is a list of objectives you can try to accomplish with this data:
Extract information from the 'record' columns (Home, Road, Division). These columns are not formatted to be directly used for analysis so you can create new columns that indicate each statistics individually. For example, you can create a new column called "Home Wins" and then write some code to extract the number of wins from the 'Home' column. Repeat with 'Home Losses' and 'Home Ties'. If you do this for each record column, you will have transformed all that information into useable data for modeling and analysis.
Create a feature called 'Undefeated' which will be a binary categorical variable. Input a 1 if the team never lost a game in that particular record column, and put a 0 if that team had any losses within that record. Repeat for all the different record columns (you might want to specify the record in the variable like this: 'Undefeated Home')
Create new columns for the winning and losing streak's value. You can name two columns 'Win Streak #' and 'Lose Streak #' and then write some code that will extract that information from the 'Strk' column. If a team was on a winning streak, then the value for their 'Lose Streak #' should be 0.
Create new columns that indicate which division a team is in!
Have some fun and engineer some of your own features!!
Facebook
Twitterhttps://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/
In the high-stakes world of professional football, public opinion often forms around emotions, loyalties, and subjective interpretations. The project at hand aims to transcend these biases by delving into a robust, data-driven analysis of Real Madrid's performance in the UEFA Champions League over the past decade.
Through a blend of traditional statistical methods, machine learning models, game theory, psychology, philosophy, and even military strategies, this investigation presents a multifaceted view of what contributes to a football team's success and how performance can be objectively evaluated.
The EDA consists of two layers:
The goal of this analysis is multifaceted: 1. Unveil Hidden Statistics: To reveal the underlying patterns often overlooked in casual discussions. 2. Demonstrate the Impact of Probability: How it shapes matches and seasons. 3. Explore Interdisciplinary Influences: Including Game Theory, Strategy, Cooperation, Psychology, Physiology, Military Training, Luck, Economics, Philosophy, and even Freudian Analysis. 4. Challenge Subjective Bias: By presenting a well-rounded, evidence-based view of football performance.
This project stands as a testament to the profound complexity of football performance and the nuanced insights that can be derived through rigorous scientific analysis. Whether a data scientist recruiter, football fanatic, or curious mind, the findings herein offer a unique perspective that bridges the gap between passion and empiricism.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset contains biographical information derived from articles on English Wikipedia as it stood in early June 2024. It was created as part of the Structured Contents initiative at Wikimedia Enterprise and is intended for evaluation and research use.
The beta sample dataset is a subset of the Structured Contents Snapshot focusing on people with infoboxes in EN wikipedia; outputted as json files (compressed in tar.gz).
We warmly welcome any feedback you have. Please share your thoughts, suggestions, and any issues you encounter on the discussion page for this dataset here on Kaggle.
Noteworthy Included Fields: - name - title of the article. - identifier - ID of the article. - image - main image representing the article's subject. - description - one-sentence description of the article for quick reference. - abstract - lead section, summarizing what the article is about. - infoboxes - parsed information from the side panel (infobox) on the Wikipedia article. - sections - parsed sections of the article, including links. Note: excludes other media/images, lists, tables and references or similar non-prose sections.
The Wikimedia Enterprise Data Dictionary explains all of the fields in this dataset.
Infoboxes - Compressed: 2GB - Uncompressed: 11GB
Infoboxes + sections + short description - Size of compressed file: 4.12 GB - Size of uncompressed file: 21.28 GB
Article analysis and filtering breakdown: - total # of articles analyzed: 6,940,949 - # people found with QID: 1,778,226 - # people found with Category: 158,996 - people found with Biography Project: 76,150 - Total # of people articles found: 2,013,372 - Total # people articles with infoboxes: 1,559,985 End stats - Total number of people articles in this dataset: 1,559,985 - that have a short description: 1,416,701 - that have an infobox: 1,559,985 - that have article sections: 1,559,921
This dataset includes 235,146 people articles that exist on Wikipedia but aren't yet tagged on Wikidata as instance of:human.
This dataset was originally extracted from the Wikimedia Enterprise APIs on June 5, 2024. The information in this dataset may therefore be out of date. This dataset isn't being actively updated or maintained, and has been shared for community use and feedback. If you'd like to retrieve up-to-date Wikipedia articles or data from other Wikiprojects, get started with Wikimedia Enterprise's APIs
The dataset is built from the Wikimedia Enterprise HTML “snapshots”: https://enterprise.wikimedia.com/docs/snapshot/ and focuses on the Wikipedia article namespace (namespace 0 (main)).
Wikipedia is a human generated corpus of free knowledge, written, edited, and curated by a global community of editors since 2001. It is the largest and most accessed educational resource in history, accessed over 20 billion times by half a billion people each month. Wikipedia represents almost 25 years of work by its community; the creation, curation, and maintenance of millions of articles on distinct topics. This dataset includes the biographical contents of English Wikipedia language editions: English https://en.wikipedia.org/, written by the community.
Wikimedia Enterprise provides this dataset under the assumption that downstream users will adhere to the relevant free culture licenses when the data is reused. In situations where attribution is required, reusers should identify the Wikimedia project from which the content was retrieved as the source of the content. Any attribution should adhere to Wikimedia’s trademark policy (available at https://foundation.wikimedia.org/wiki/Trademark_policy) and visual identity guidelines (ava...
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
I am a really huge football fan and the Premier League is one of my favourite football (or soccer, whatever you like to call it) leagues. So, as my very first dataset, I thought this would be a great opportunity for me to make a dataset of player statistics of all seasons from the Premier League.
The Premier League, often referred to as the English Premier League or the EPL outside England, is the top level of the English football league system. Contested by 20 clubs, it operates on a system of promotion and relegation with the English Football League (EFL). Contested by 20 clubs, it operates on a system of promotion and relegation with the English Football League.
Home to some of the most famous clubs, players, managers and stadiums in world football, the Premier League is the most-watched league on the planet with one billion homes watching the action in 188 countries.The league takes place between August and May and involves the teams playing each other home and away across the season, a total of 380 matches.
Three points are awarded for a win, one point for a draw and none for a defeat, with the team with the most points at the end of the season winning the Premier League title. The teams that finish in the bottom three of the league table at the end of the campaign are relegated to the Championship, the second tier of English football. Those teams are replaced by three clubs promoted from the Championship; the sides that finish in first and second place and the third via the end-of-season playoffs.
The data was acquired from:
https://www.premierleague.com/
I made a BeautifulSoup4 Web Scrapper in Python3 which automatically outputs a csv file of all the player statistics. The runtime of the file is about 20 minutes but it varies with the bandwidth of the Internet connection. I made this program so that this dataset could be updated weekly. The reason for weekly update is that the statistics change after each match played by the player so I felt that for the most up-to-date results, such a program is needed. Planning this project took 2 days. Making the program in Python3 took 7 days and the testing and bug fixing took another 5 days. The project was completed in the span of 2 weeks.
Source credits : https://www.premierleague.com/ Image credits : https://rb.gy/wuiwth
How do variables like age, nationality and club affect the player performance?
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterThis is a collection of statistical projects where I used Microsoft Excel. The definition of each project was given by ProfessionAI, while the statistical analysis part was done by me. More specifically: - customer_complaints_assignment is an example of Introduction to Data Analytics where, given a dataset with complaints of customers of financial companies, tasks about filtering, counting and basic analytics were done; - trades_on_exchanges is a project for Advanced Data Analytics where statistical analysis about trading operations where done; - progetto_finale_inferenza is a project about Statistica Inference where, from a toy dataset about the population of a city, inference analysis was made.