Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
I wanted to make some geospatial visualizations to convey the current severity of COVID19 in different parts of the U.S..
I liked the NYTimes COVID dataset, but it was lacking information on county boundary shape data, population per county, new cases / deaths per day, and per capita calculations, and county demographics.
After a lot of work tracking down the different data sources I wanted and doing all of the data wrangling and joins in python, I wanted to open-source the final enriched data set in order to give others a head start in their COVID-19 related analytic, modeling, and visualization efforts.
This dataset is enriched with county shapes, county center point coordinates, 2019 census population estimates, county population densities, cases and deaths per capita, and calculated per day cases / deaths metrics. It contains daily data per county back to January, allowing for analyizng changes over time.
UPDATE: I have also included demographic information per county, including ages, races, and gender breakdown. This could help determine which counties are most susceptible to an outbreak.
Geospatial analysis and visualization - Which counties are currently getting hit the hardest (per capita and totals)? - What patterns are there in the spread of the virus across counties? (network based spread simulations using county center lat / lons) -county population densities play a role in how quickly the virus spreads? -how does a specific county/state cases and deaths compare to other counties/states? Join with other county level datasets easily (with fips code column)
See the column descriptions for more details on the dataset
COVID-19 U.S. Time-lapse: Confirmed Cases per County (per capita)
https://github.com/ringhilterra/enriched-covid19-data/blob/master/example_viz/covid-cases-final-04-06.gif?raw=true" alt="">-
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The Dirty Cafe Sales dataset contains 10,000 rows of synthetic data representing sales transactions in a cafe. This dataset is intentionally "dirty," with missing values, inconsistent data, and errors introduced to provide a realistic scenario for data cleaning and exploratory data analysis (EDA). It can be used to practice cleaning techniques, data wrangling, and feature engineering.
dirty_cafe_sales.csv| Column Name | Description | Example Values |
|---|---|---|
Transaction ID | A unique identifier for each transaction. Always present and unique. | TXN_1234567 |
Item | The name of the item purchased. May contain missing or invalid values (e.g., "ERROR"). | Coffee, Sandwich |
Quantity | The quantity of the item purchased. May contain missing or invalid values. | 1, 3, UNKNOWN |
Price Per Unit | The price of a single unit of the item. May contain missing or invalid values. | 2.00, 4.00 |
Total Spent | The total amount spent on the transaction. Calculated as Quantity * Price Per Unit. | 8.00, 12.00 |
Payment Method | The method of payment used. May contain missing or invalid values (e.g., None, "UNKNOWN"). | Cash, Credit Card |
Location | The location where the transaction occurred. May contain missing or invalid values. | In-store, Takeaway |
Transaction Date | The date of the transaction. May contain missing or incorrect values. | 2023-01-01 |
Missing Values:
Item, Payment Method, Location) may contain missing values represented as None or empty cells.Invalid Values:
"ERROR" or "UNKNOWN" to simulate real-world data issues.Price Consistency:
The dataset includes the following menu items with their respective price ranges:
| Item | Price($) |
|---|---|
| Coffee | 2 |
| Tea | 1.5 |
| Sandwich | 4 |
| Salad | 5 |
| Cake | 3 |
| Cookie | 1 |
| Smoothie | 4 |
| Juice | 3 |
This dataset is suitable for: - Practicing data cleaning techniques such as handling missing values, removing duplicates, and correcting invalid entries. - Exploring EDA techniques like visualizations and summary statistics. - Performing feature engineering for machine learning workflows.
To clean this dataset, consider the following steps: 1. Handle Missing Values: - Fill missing numeric values with the median or mean. - Replace missing categorical values with the mode or "Unknown."
Handle Invalid Values:
"ERROR" and "UNKNOWN" with NaN or appropriate values.Date Consistency:
Feature Engineering:
Day of the Week or Transaction Month, for further analysis.This dataset is released under the CC BY-SA 4.0 License. You are free to use, share, and adapt it, provided you give appropriate credit.
If you have any questions or feedback, feel free to reach out through the dataset's discussion board on Kaggle.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 16.0px 'Andale Mono'; color: #29f914; background-color: #000000} span.s1 {font-variant-ligatures: no-common-ligatures} These files are intended for use with the Data Carpentry Genomics curriculum (https://datacarpentry.org/genomics-workshop/). Files will be useful for instructors teaching this curriculum in a workshop setting, as well as individuals working through these materials on their own.
This curriculum is normally taught using Amazon Web Services (AWS). Data Carpentry maintains an AWS image that includes all of the data files needed to use these lesson materials. For information on how to set up an AWS instance from that image, see https://datacarpentry.org/genomics-workshop/setup.html. Learners and instructors who would prefer to teach on a different remote computing system can access all required files from this FigShare dataset.
This curriculum uses data from a long term evolution experiment published in 2016: Tempo and mode of genome evolution in a 50,000-generation experiment (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4988878/) by Tenaillon O, Barrick JE, Ribeck N, Deatherage DE, Blanchard JL, Dasgupta A, Wu GC, Wielgoss S, Cruveiller S, Médigue C, Schneider D, and Lenski RE. (doi: 10.1038/nature18959). All sequencing data sets are available in the NCBI BioProject database under accession number PRJNA294072 (https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA294072).
backup.tar.gz: contains original fastq files, reference genome, and subsampled fastq files. Directions for obtaining these files from public databases are given during the lesson https://datacarpentry.org/wrangling-genomics/02-quality-control/index.html). On the AWS image, these files are stored in ~/.backup directory. 1.3Gb in size.
Ecoli_metadata.xlsx: an example Excel file to be loaded during the R lesson.
shell_data.tar.gz: contains the files used as input to the Introduction to the Command Line for Genomics lesson (https://datacarpentry.org/shell-genomics/).
sub.tar.gz: contains subsampled fastq files that are used as input to the Data Wrangling and Processing for Genomics lesson (https://datacarpentry.org/wrangling-genomics/). 109Mb in size.
solutions: contains the output files of the Shell Genomics and Wrangling Genomics lessons, including fastqc output, sam, bam, bcf, and vcf files.
vcf_clean_script.R: converts vcf output in .solutions/wrangling_solutions/variant_calling_auto to single tidy data frame.
combined_tidy_vcf.csv: output of vcf_clean_script.R
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Objective: To develop a clinical informatics pipeline designed to capture large-scale structured EHR data for a national patient registry.
Materials and Methods: The EHR-R-REDCap pipeline is implemented using R-statistical software to remap and import structured EHR data into the REDCap-based multi-institutional Merkel Cell Carcinoma (MCC) Patient Registry using an adaptable data dictionary.
Results: Clinical laboratory data were extracted from EPIC Clarity across several participating institutions. Labs were transformed, remapped and imported into the MCC registry using the EHR labs abstraction (eLAB) pipeline. Forty-nine clinical tests encompassing 482,450 results were imported into the registry for 1,109 enrolled MCC patients. Data-quality assessment revealed highly accurate, valid labs. Univariate modeling was performed for labs at baseline on overall survival (N=176) using this clinical informatics pipeline.
Conclusion: We demonstrate feasibility of the facile eLAB workflow. EHR data is successfully transformed, and bulk-loaded/imported into a REDCap-based national registry to execute real-world data analysis and interoperability.
Methods eLAB Development and Source Code (R statistical software):
eLAB is written in R (version 4.0.3), and utilizes the following packages for processing: DescTools, REDCapR, reshape2, splitstackshape, readxl, survival, survminer, and tidyverse. Source code for eLAB can be downloaded directly (https://github.com/TheMillerLab/eLAB).
eLAB reformats EHR data abstracted for an identified population of patients (e.g. medical record numbers (MRN)/name list) under an Institutional Review Board (IRB)-approved protocol. The MCCPR does not host MRNs/names and eLAB converts these to MCCPR assigned record identification numbers (record_id) before import for de-identification.
Functions were written to remap EHR bulk lab data pulls/queries from several sources including Clarity/Crystal reports or institutional EDW including Research Patient Data Registry (RPDR) at MGB. The input, a csv/delimited file of labs for user-defined patients, may vary. Thus, users may need to adapt the initial data wrangling script based on the data input format. However, the downstream transformation, code-lab lookup tables, outcomes analysis, and LOINC remapping are standard for use with the provided REDCap Data Dictionary, DataDictionary_eLAB.csv. The available R-markdown ((https://github.com/TheMillerLab/eLAB) provides suggestions and instructions on where or when upfront script modifications may be necessary to accommodate input variability.
The eLAB pipeline takes several inputs. For example, the input for use with the ‘ehr_format(dt)’ single-line command is non-tabular data assigned as R object ‘dt’ with 4 columns: 1) Patient Name (MRN), 2) Collection Date, 3) Collection Time, and 4) Lab Results wherein several lab panels are in one data frame cell. A mock dataset in this ‘untidy-format’ is provided for demonstration purposes (https://github.com/TheMillerLab/eLAB).
Bulk lab data pulls often result in subtypes of the same lab. For example, potassium labs are reported as “Potassium,” “Potassium-External,” “Potassium(POC),” “Potassium,whole-bld,” “Potassium-Level-External,” “Potassium,venous,” and “Potassium-whole-bld/plasma.” eLAB utilizes a key-value lookup table with ~300 lab subtypes for remapping labs to the Data Dictionary (DD) code. eLAB reformats/accepts only those lab units pre-defined by the registry DD. The lab lookup table is provided for direct use or may be re-configured/updated to meet end-user specifications. eLAB is designed to remap, transform, and filter/adjust value units of semi-structured/structured bulk laboratory values data pulls from the EHR to align with the pre-defined code of the DD.
Data Dictionary (DD)
EHR clinical laboratory data is captured in REDCap using the ‘Labs’ repeating instrument (Supplemental Figures 1-2). The DD is provided for use by researchers at REDCap-participating institutions and is optimized to accommodate the same lab-type captured more than once on the same day for the same patient. The instrument captures 35 clinical lab types. The DD serves several major purposes in the eLAB pipeline. First, it defines every lab type of interest and associated lab unit of interest with a set field/variable name. It also restricts/defines the type of data allowed for entry for each data field, such as a string or numerics. The DD is uploaded into REDCap by every participating site/collaborator and ensures each site collects and codes the data the same way. Automation pipelines, such as eLAB, are designed to remap/clean and reformat data/units utilizing key-value look-up tables that filter and select only the labs/units of interest. eLAB ensures the data pulled from the EHR contains the correct unit and format pre-configured by the DD. The use of the same DD at every participating site ensures that the data field code, format, and relationships in the database are uniform across each site to allow for the simple aggregation of the multi-site data. For example, since every site in the MCCPR uses the same DD, aggregation is efficient and different site csv files are simply combined.
Study Cohort
This study was approved by the MGB IRB. Search of the EHR was performed to identify patients diagnosed with MCC between 1975-2021 (N=1,109) for inclusion in the MCCPR. Subjects diagnosed with primary cutaneous MCC between 2016-2019 (N= 176) were included in the test cohort for exploratory studies of lab result associations with overall survival (OS) using eLAB.
Statistical Analysis
OS is defined as the time from date of MCC diagnosis to date of death. Data was censored at the date of the last follow-up visit if no death event occurred. Univariable Cox proportional hazard modeling was performed among all lab predictors. Due to the hypothesis-generating nature of the work, p-values were exploratory and Bonferroni corrections were not applied.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Dataset Overview This dataset simulates the academic and extracurricular records of students in a Nigerian primary school. It contains three tables designed to capture key aspects of the student lifecycle, including demographic information, academic scores, and their affiliations with sport houses. The dataset can be used for educational purposes, research, and exploratory data analysis.
Context and Inspiration This dataset is inspired by the structure of Nigerian primary schools, where students are grouped into sport houses for extracurricular activities and assessed on academic performance. It is a useful resource for: Exploring relationships between demographics, academic performance, and extracurricular activities. Analyzing patterns in hobbies and character traits. Creating visualizations for school or student performance analytics.
Usage This dataset is synthetic but can be used for: Data science practice, including cleaning, wrangling, and visualization. Developing machine learning models to predict academic outcomes or classify students. Creating dashboards and reports for educational analytics.
License This dataset is synthetic and open for public use. Feel free to use it for learning, research, and creative projects.
Acknowledgments The dataset was generated using Python libraries, including: Faker for generating realistic student data. Pandas for organizing and exporting the dataset.
Example Questions to Explore Which sport house has the best average performance in academics? Is there a correlation between hobbies and academic scores? Are there performance differences between male and female students? What is the distribution of student ages across sport houses?
Facebook
TwitterAdditional file 3: Example of report formatting request to send to the lab.
Facebook
TwitterThis repository contains the results of a research project which provides a benchmark dataset for extracting greenhouse gas emissions from corporate annual and sustainability reports. The paper which explains the data collection methodology and provides a detailed description of the benchmark dataset can be found in the Nature Scientific Data journal publication.
The zipped datasets file contains two datasets, gold_standard and annotation_dataset(inside the outer zip file there is a password-protected zip file containing the two datasets. To unpack, use the password is provided in the outer zip file).
llm_ in annotation_dataset). The extracted emissions follow the categories Scope 1, 2 (market-based) and 2 (location-based) and 3, as defined in the GHGP protocol (see variables scope).non_expert_ in annotation_dataset), then by expert groups (columns prefixed with exp_group_ in annotation_dataset) in case of disagreement of non-experts and finally in a discussion of all experts (columns prefixed with exp_disc in annotation_dataset) in case of disagreement between expert groups. The annotation guidelines for the non-experts and experts are also included in this repository.gold_standard. Codebooks detailing each variable of each of the two datasets are also provided. More details about the annotation template or the data wrangling scripts can be found in the GitHub repository. Users can match the two datasets (gold_standard and annotation_dataset) using the variable combination of company_name, report_year and merge_id (index column). The merge_id already includes the company name and report year implicitly, but to avoid column duplication in the join operation, it should be included as join variables. For example this is useful when comparing LLM extractions to gold standard data.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This notebook aims to predict the HVAC system's power consumption (active_power) at a given time using the previous 15 minutes of sensor and operational data. For example, to predict the power at 10:00, the model uses data from 9:45 to 10:00. The notebook provides data cleaning, feature engineering, and modeling steps for this predictive task. Additionally, it may require further feature engineering and data wrangling to enhance model performance and data usability.
This dataset contains 3 months of historical data from an HVAC system, with records every 5 minutes. The data includes operational parameters and environmental sensor readings, both inside and outside the cooled space.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The interpretation of biological data sets is essential for generating hypotheses that guide research, yet modern methods of global analysis challenge our ability to discern meaningful patterns and then convey results in a way that can be easily appreciated. Proteomic data is especially challenging because mass spectrometry detectors often miss peptides in complex samples, resulting in sparsely populated data sets. Using the R programming language and techniques from the field of pattern recognition, we have devised methods to resolve and evaluate clusters of proteins related by their pattern of expression in different samples in proteomic data sets. We examined tyrosine phosphoproteomic data from lung cancer samples. We calculated dissimilarities between the proteins based on Pearson or Spearman correlations and on Euclidean distances, whilst dealing with large amounts of missing data. The dissimilarities were then used as feature vectors in clustering and visualization algorithms. The quality of the clusterings and visualizations were evaluated internally based on the primary data and externally based on gene ontology and protein interaction networks. The results show that t-distributed stochastic neighbor embedding (t-SNE) followed by minimum spanning tree methods groups sparse proteomic data into meaningful clusters more effectively than other methods such as k-means and classical multidimensional scaling. Furthermore, our results show that using a combination of Spearman correlation and Euclidean distance as a dissimilarity representation increases the resolution of clusters. Our analyses show that many clusters contain one or more tyrosine kinases and include known effectors as well as proteins with no known interactions. Visualizing these clusters as networks elucidated previously unknown tyrosine kinase signal transduction pathways that drive cancer. Our approach can be applied to other data types, and can be easily adopted because open source software packages are employed.
Facebook
TwitterThis intermediate level data set was extracted from the census bureau database. There are 48842 instances of data set, mix of continuous and discrete (train=32561, test=16281).
The data set has 15 attribute which include age, sex, education level and other relevant details of a person. The data set will help to improve your skills in Exploratory Data Analysis, Data Wrangling, Data Visualization and Classification Models.
Feel free to explore the data set with multiple supervised and unsupervised learning techniques. The Following description gives more details on this data set:
age: the age of an individual.workclass: The type of work or employment of an individual. It can have the following categories:
Final Weight: The weights on the CPS files are controlled to independent estimates of the civilian noninstitutional population of the US. These are prepared monthly for us by Population Division here at the Census Bureau. We use 3 sets of controls.These are: 1. A single cell estimate of the population 16+ for each state. 2. Controls for Hispanic Origin by age and sex. 3. Controls by Race, age and sex.
We use all three sets of controls in our weighting program and "rake" through them 6 times so that by the end we come back to all the controls we used.
People with similar demographic characteristics should have similar weights. There is one important caveat to remember about this statement. That is that since the CPS sample is actually a collection of 51 state samples, each with its own probability of selection, the statement only applies within state.
education: The highest level of education completed. education-num: The number of years of education completed. marital-status: The marital status. occupation: Type of work performed by an individual.relationship: The relationship status.race: The race of an individual. sex: The gender of an individual.capital-gain: The amount of capital gain (financial profit).capital-loss: The amount of capital loss an individual has incurred.hours-per-week: The number of hours works per week.native-country: The country of origin or the native country.income: The income level of an individual and serves as the target variable. It indicates whether the income is greater than $50,000 or less than or equal to $50,000, denoted as (>50K, <=50K).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 2. Example of transformed metadata: In this .xlsx (MS Excel) file, we list all the output metadata categories generated for each sample from the transformation of the 1KGP input datasets. The output metadata include information collected from all the four 1KGP metadata files considered. Some categories are not reported in the source metadata files—they are identified by the label manually_curated_...—and were added by the developed pipeline to store technical details (e.g., download date, the md5 hash of the source file, file size, etc.) and information derived from the knowledge of the source, such as the species, the processing pipeline used in the source and the health status. For every information category, the table reports a possible value. The third column (cardinality > 1) tells whether the same key can appear multiple times in the output GDM metadata file. This is used to represent multi-valued metadata categories; for example, in a GDM metadata file, the key manually_curated_chromosome appears once for every chromosome mutated by the variants of the sample.
Facebook
TwitterAs a final project for Data Wrangling this fall (2024) we were tasked with using our new skills in collecting and importing data using web scraping, online API queries, and file import to create a relational data set of 3 tables with 2 related. We also had to use our tidying skills to clean and transform our inputted data to prepare it for visualization and analysis by focusing on column types and names, categorical variables, etc.
An example notebook of analysis of this information is provided with 5 example of analysis using mutating joins, tidying, and/or ggplot.
Facebook
TwitterDesign a web portal to automate the various operation performed in machine learning projects to solve specific problems related to supervised or unsupervised use case.. Web portal must have the capabilities to perform below-mentioned task: 1. Extract Transform Load: a. Extract: Portal should provide the capabilities to configure any data source example. Cloud Storage (AWS, Azure, GCP), Database (RDBMS, NoSQL,), and real-time streaming data to extract data into portportal. (Allow feasibility to write cucustom script if required to connect to any data source to extract data) b. Transform: Portal should provide various inbuilt functions/components to apply rich set of transformation to transform extracted data into desired format. c. Load: Portal should be able to save data into any of the cloud storage after extracted data transformed into desired format. d. Allow user to write custom script in python if some of the functionality is not present in the portal. 2. Exploratory Data Analysis: Portal should allow users to perform exploratory data analysis. 3. Data Preparation: data wrangling, feature extraction and feature selection should be automation with minimal user intervention. 4. Application must suggest appropriate machine learning algorithm which is best suitable for the use case and perform best model search operation to automate model development operation. 5. Application should provide feature to deploy model in any of the cloud and application should create prediction API to predict new instances. 6. Application should log each and every detail so that each activity can be audited in future to investigate any of the event. 7. Detail report should be generated for ETL, EDA, Data preparation and Model development and deployment. 8. Create a dashboard to monitor model performance and create various alert mechanism to notify appropriate user to take necessary precaution. 9. Create functionality to implement retraining for existing model if it is necessary. 10.Portal must be designed in such a way that it can be used by multiple organization/user where each organization/user is isolated from other. 11.Portal should provide functionality to manage user. Similar to RBAC concept used in Cloud. (It is not necessary to build so many role but design it in such a way that it can add role in future so that newly created role can also be applied to users.) Organization/User can have multiple user and each user will have specific role. 12.Portal should have a scheduler to schedule training or prediction task and appropriate alert regarding to scheduled job should be notified to subscriber/configured email id. 13.Implement watcher functionality to perform prediction as soon as file arrived at input location.
You have to build a solution that should summarize the various news articles from different reading categories.
Code: You are supposed to write a code in a modular fashion Safe: It can be used without causing harm. Testable: It can be tested at the code level. Maintainable: It can be maintained, even as your codebase grows. Portable: It works the same in every environment (operating system) You have to maintain your code on GitHub. You have to keep your GitHub repo public so that anyone can check your code. Proper readme file you have to maintain for any project development. You should include basic workflow and execution of the entire project in the readme
file on GitHub Follow the coding standards: https://www.python.org/dev/peps/pep-0008/
NoSQL) or use multiple database.
You can use any cloud platform for this entire solution hosting like AWS, Azure or GCP.
Logging is a must for every action performed by your code use the python logging library for this.
Use source version control tool to implement CI, CD pipeline, e.g.: Azure Devops, Github, Circle CI.
You can host your application in the cloud platform using automated CI, CD pipeline.
You have to submit complete solution design strate...
Facebook
Twitterhttps://www.usa.gov/government-works/https://www.usa.gov/government-works/
Originally, I was planning to use the Python Quandl api to get the data from here because it is already conveniently in time-series format. However, the data is split by reporting agency which makes it difficult to get an accurate image of the true short ratio because of missing data/difficulty in aggregation. So, I clicked on the source link which turned out to be a gold mine because of their consolidated data. Only downside was that it was all in .txt format so I had to use regex to parse through and data scraping to get the information from the website but that was a good refresher 😄.
For better understanding of what the values in the text file mean, you can read this pdf from FINRA: https://www.finra.org/sites/default/files/2020-12/short-sale-volume-user-guide.pdf
I condensed all the individual text files into a single .txt file such that it's much faster and less complex to write code compared to having to iterate through each individual .txt file. I created several functions for this dataset so please check out my workbook "FINRA Short Ratio functions" where I have described step by step on how I gathered the data and formatted it so that you can understand and modify them to fit your needs. Note that the data is only for the range of 1st April 2020 onwards (20200401 to 20210312 as of gathering the data) and the contents are separated by | delimiters so I used \D (non-digit) in regex to avoid confusion with the (a|b) pattern syntax.
If you need historical data before April 2020, you can use the quandl database but it has non-consolidated information and you have to make a reference call for each individual stock for each agency so you would need to manually input tickers or get a list of all tickers through regex of the txt files or something like that 😅.
An excellent task to combine regular expressions (regex), web scraping, plotting, and data wrangling... see my notebook for an example with annotated workflow. Please comment and feel free to fork and modify my workbook to change the functionality. Possibly the short volumes can be combined with p/b ratios or price data to see the correlation --> can use seaborn pairgrid to visualise this for multiple stocks?
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
These datasets are scraped from basketball-reference.com and include all NBA games between seasons 1996-97 to 2020-21. My goal in this project is to create datasets that can be used by beginners to practice basic data science skills, such as data wrangling and cleaning, using a setting in which it is easy to go to the raw data to understand surprising results. For example, outliers can be difficult to understand when working with a taxi dataset, whereas NBA has a large community of reporters, experts and game videos that may help you understand what is going on with the data.
Web scrapers used to collect can be found from: https://github.com/PatrickH1994/nba_webscrapes
The dataset will include all the information available at basketball-reference.com once the project is done.
Current files: - Games - Play-by-play - Player stats - Salary data
We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.
Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains information from the NFL Combine (2009 to 2019), including the results from sports performance tests and draft outcomes.
As sports statistics are in the public domain, this database was freely downloaded from https://www.pro-football-reference.com/
I appreciate the efforts of https://www.pro-football-reference.com/ in collating and hosting sports related data, and Kaggle for providing a platform for sharing datasets and knowledge.
This dataset is useful for beginners and intermediate users, where they can practice visualisations, analytics, imputation, data cleaning/wrangling, and classification modelling. For example: What are the variables of importance in predicing round pick or draft status? Which school has the highest number of players being drafted into NFL? What position type or player type is most represented at the NFL Combine? Do drafted and undrafted players perform differently on performance tests?
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
🧪 Covid-19 Clinical Trials Dataset (Raw + Cleaned)
This dataset offers a deep look into the global clinical research landscape during the Covid-19 pandemic. Sourced directly from ClinicalTrials.gov, it provides structured and semi-structured information on registered Covid-19-related clinical trials across countries, sponsors, and phases.
📁 What’s Included • COVID_clinical_trials.csv — Raw dataset as obtained from ClinicalTrials.gov • Covid-19_cleaned_dataset.csv — Preprocessed version for direct use in data analysis and visualization tasks
🎯 Use Case & Learning Goals
This dataset is ideal for: • Practicing data cleaning, preprocessing, and wrangling • Performing exploratory data analysis (EDA) • Building interactive dashboards (e.g., with Tableau or Plotly) • Training ML models for classification or forecasting (e.g., predicting trial outcomes) • Exploring trends in clinical trial research during global health emergencies
🔍 Key Features
Each row represents a registered clinical trial and includes fields such as: • NCT Number (unique ID) • Study Title • Start Date and Completion Date • Phase • Study Type (Interventional/Observational) • Enrollment Size • Country, Sponsor, and Intervention Type • Study Status (Recruiting, Completed, Withdrawn, etc.)
✅ Cleaned Dataset
The cleaned version includes: • Standardized column naming • Filled missing values where possible • Removed duplicates and a few columns
📊 Example Applications • Country-wise contribution analysis • Sponsor landscape visualization • Trial timeline and phase progression charts • Predictive modeling of trial duration or status
🙏 Acknowledgments
Thanks to ClinicalTrials.gov for providing public access to this critical data.
Facebook
Twitterhttps://www.reddit.com/wiki/apihttps://www.reddit.com/wiki/api
Overview The dataset was downloaded as a CSV file containing 1M posts from the r/Jokes subreddit. Of the relevant features, the "title" is the title's post or the joke's setup. The "selftext" is the punchline, or what you see once a user clicks on the post's content. It's worth nothin that many jokes in this data table don't meet this criterion (nans).
Score The "score" value describes the number of upvotes, i.e. the number of positive ratings the post received. Posts can additionally be downvoted, and while Reddit allows for negative values, the minimum value in the dataset is zero. When a user posts something to Reddit, however, they are automatically given a single upvote, so I am making the assumption that values of zero in this dataset were downvoted.
Exploratory Data Analysis - Try to understand intuitively "what makes a joke funny" using simple exploratory data analysis.
Funny / Not Funny - Classification - The ultimate goal in wrangling these data is to create a dataset to classify as either funny or not funny using the upvotes.
Jokes Generation - Train and generate jokes using a language generation model (GPT for example).
Funny Jokes Generation - Training and generating jokes using language models is one thing but generating Funny jokes using language models is a completely different task! (which is much much harder to do)
Are you up for a challenge? ;)
Facebook
Twitteroslo-city-bike License: Norwegian Licence for Open Government Data (NLOD) 2.0 According to the license, we have full rights to collect, use, modify, and distribute this data, provided you clearly indicate the source (which I do).
Folder oslobysykkel contains all available data from 2019 to 2025. Format: oslobysykkel-YYYY-MM.csv. why is oslo still appearing in the file names? because there is also similar data for Trondheim and Bergen
from oslobysykkel.no Variable Format Description started_at Timestamp Timestamp of when the trip started ended_at Timestamp Timestamp of when the trip ended duration Integer Duration of trip in seconds start_station_id String Unique ID for start station start_station_name String Name of start station start_station_description String Description of where start station is located start_station_latitude Decimal degrees in WGS84 Latitude of start station start_station_longitude Decimal degrees in WGS84 Longitude of start station end_station_id String Unique ID for end station end_station_name String Name of end station end_station_description String Description of where end station is located end_station_latitude Decimal degrees in WGS84 Latitude of end station end_station_longitude Decimal degrees in WGS84 Longitude of end station
Please note: this data and my analysis focuses on the new data format, but historical data for the period April 2016 - December 2018 (Legacy Trip Data) has a different pattern.
I myself was extremely fascinated by this open data of Oslo City Bike and in the process of deep analysis saw broad prospects. This interest turned into an idea to create a data-analytical problem book or even platfrom 'exercise bike'. Publishing this dataset to make it convenient for my own further use in the next phases of the project (Clustering, Forecasting), as well as so that anyone can participate in analysis and modeling based on this exciting data.
**Autumn's remake of Oslo bike sharing data analysis ** https://colab.research.google.com/drive/1tAxrIWVK5V-ptKLJBdODjy10zHlsppFv?usp=sharing
https://drive.google.com/file/d/17FP9Bd5opoZlw40LRxWtycgJJyXSAdC6/view
Full notebooks with code, visualizations, and commentary will be published soon! This dataset is the backbone of an ongoing project — stay tuned for see a deeper dives into anomaly detection, station clustering, and interactive learning challenges.
Index of my notebooks Phase 1: Cleaned Data & Core Insights Time-Space Dynamics Exploratory
Clustering and Segmentation Demand Forecasting (Time Series) Geospatial Analysis (Network Analysis)
Similar dataset https://www.kaggle.com/code/florestancharlaix/oslo-city-bikes-analysis
links to works I have found or that have inspired me
Exploring Open Data from Oslo City Bike Jon Olave — visualization of popular routes and seasonality analysis.
Oslo City Bike Data Wrangling Karl Tryggvason — predicting bicycle availability at stations, focusing on everyday use (e.g., trips to kindergarten).
Helsinki City Bikes: Exploratory Data Analysis Analysis of a similar system in Helsinki — useful for comparative studies and methodological ideas.
The idea is to connect with other data. For example I did it for weather data - integrate temperature, precipitation, and wind speed to explain variations in daily demand. https://meteostat.net/en/place/no/oslo
I also used data from Airbnb (that's where I took division into neighbourhoods) https://data.insideairbnb.com/norway/oslo/oslo/2025-06-27/visualisations/neighbourhoods.csv
oslo bike-sharing eda feature-engineering geospatial time-series
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
In 2006, global concern was raised over the rapid decline in the honeybee population, an integral component to American honey agriculture. Large numbers of hives were lost to Colony Collapse Disorder, a phenomenon of disappearing worker bees causing the remaining hive colony to collapse. Speculation to the cause of this disorder points to hive diseases and pesticides harming the pollinators, though no overall consensus has been reached. Twelve years later, some industries are observing recovery but the American honey industry is still largely struggling. The U.S. used to locally produce over half the honey it consumes per year. Now, honey mostly comes from overseas, with 350 of the 400 million pounds of honey consumed every year originating from imports. This dataset provides insight into honey production supply and demand in America by state from 1998 to 2012.
The National Agricultural Statistics Service (NASS) is the primary data reporting body for the US Department of Agriculture (USDA). NASS's mission is to "provide timely, accurate, and useful statistics in service to U.S. agriculture". From datasets to census surveys, their data covers virtually all aspects of U.S. agriculture. Honey production is one of the datasets offered. Click here for the original page containing the data along with related datasets such as Honey Bee Colonies and Cost of Pollination. Data wrangling was performed in order to clean the dataset. honeyproduction.csv is the final tidy dataset suitable for analysis. The three other datasets (which include "honeyraw" in the title) are the original raw data downloaded from the site. They are uploaded to this page along with the "**Wrangling The Honey Production Dataset**" kernel as an example to show users how data can be wrangled into a cleaner format. Useful metadata on certain variables of the honeyproduction dataset is provided below:
Honey production data was published by the National Agricultural Statistics Service (NASS) of the U.S. Department of Agriculture. The beautiful banner photo was by Eric Ward on Unsplash.
Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
I wanted to make some geospatial visualizations to convey the current severity of COVID19 in different parts of the U.S..
I liked the NYTimes COVID dataset, but it was lacking information on county boundary shape data, population per county, new cases / deaths per day, and per capita calculations, and county demographics.
After a lot of work tracking down the different data sources I wanted and doing all of the data wrangling and joins in python, I wanted to open-source the final enriched data set in order to give others a head start in their COVID-19 related analytic, modeling, and visualization efforts.
This dataset is enriched with county shapes, county center point coordinates, 2019 census population estimates, county population densities, cases and deaths per capita, and calculated per day cases / deaths metrics. It contains daily data per county back to January, allowing for analyizng changes over time.
UPDATE: I have also included demographic information per county, including ages, races, and gender breakdown. This could help determine which counties are most susceptible to an outbreak.
Geospatial analysis and visualization - Which counties are currently getting hit the hardest (per capita and totals)? - What patterns are there in the spread of the virus across counties? (network based spread simulations using county center lat / lons) -county population densities play a role in how quickly the virus spreads? -how does a specific county/state cases and deaths compare to other counties/states? Join with other county level datasets easily (with fips code column)
See the column descriptions for more details on the dataset
COVID-19 U.S. Time-lapse: Confirmed Cases per County (per capita)
https://github.com/ringhilterra/enriched-covid19-data/blob/master/example_viz/covid-cases-final-04-06.gif?raw=true" alt="">-