25 datasets found

Enriched NYTimes COVID19 U.S. County Dataset
kaggle.com
zip
Updated Jun 14, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ringhilterra17 (2020). Enriched NYTimes COVID19 U.S. County Dataset [Dataset]. https://www.kaggle.com/ringhilterra17/enrichednytimescovid19
Explore at:
zip(11291611 bytes)Available download formats
Dataset updated
Jun 14, 2020
Authors
ringhilterra17
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Area covered
United States
Description
Overview and Inspiration

I wanted to make some geospatial visualizations to convey the current severity of COVID19 in different parts of the U.S..

I liked the NYTimes COVID dataset, but it was lacking information on county boundary shape data, population per county, new cases / deaths per day, and per capita calculations, and county demographics.

After a lot of work tracking down the different data sources I wanted and doing all of the data wrangling and joins in python, I wanted to open-source the final enriched data set in order to give others a head start in their COVID-19 related analytic, modeling, and visualization efforts.

This dataset is enriched with county shapes, county center point coordinates, 2019 census population estimates, county population densities, cases and deaths per capita, and calculated per day cases / deaths metrics. It contains daily data per county back to January, allowing for analyizng changes over time.

UPDATE: I have also included demographic information per county, including ages, races, and gender breakdown. This could help determine which counties are most susceptible to an outbreak.

How this data can be used

Geospatial analysis and visualization - Which counties are currently getting hit the hardest (per capita and totals)? - What patterns are there in the spread of the virus across counties? (network based spread simulations using county center lat / lons) -county population densities play a role in how quickly the virus spreads? -how does a specific county/state cases and deaths compare to other counties/states? Join with other county level datasets easily (with fips code column)

Content Details

See the column descriptions for more details on the dataset

Visualizations and Analysis Examples

COVID-19 U.S. Time-lapse: Confirmed Cases per County (per capita)

https://github.com/ringhilterra/enriched-covid19-data/blob/master/example_viz/covid-cases-final-04-06.gif?raw=true" alt="">-

Other Data Notes

Please review nytimes README for detailed notes on Covid-19 data - https://github.com/nytimes/covid-19-data/

The only update I made in regards to 'Geographic Exceptions', is that I took 'New York City' county provided in the Covid-19 data, which has all cases for 'for the five boroughs of New York City (New York, Kings, Queens, Bronx and Richmond counties) and replaced the missing FIPS for those rows with the 'New York County' fips code 36061. That way I could join to a geometry, and then I used the sum of those five boroughs population estimates for the 'New York City' estimate, which allowed me calculate 'per capita' metrics for 'New York City' entries in the Covid-19 dataset

Acknowledgements

Special thanks to NYTimes for all of their hard work gathering and consolidating all of the U.S. COVID19 related data on daily basis. Their git repo https://github.com/nytimes/covid-19-data/

Also, thanks to ykzeng for the county population density estimates: https://github.com/ykzeng/covid-19/tree/master/data-

Cafe Sales - Dirty Data for Cleaning Training

kaggle.com

zip

Updated Jan 17, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Ahmed Mohamed (2025). Cafe Sales - Dirty Data for Cleaning Training [Dataset]. https://www.kaggle.com/datasets/ahmedmohamed2003/cafe-sales-dirty-data-for-cleaning-training

Explore at:

zip(113510 bytes)Available download formats

Dataset updated

Jan 17, 2025

Authors

Ahmed Mohamed

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Dirty Cafe Sales Dataset

Overview

The Dirty Cafe Sales dataset contains 10,000 rows of synthetic data representing sales transactions in a cafe. This dataset is intentionally "dirty," with missing values, inconsistent data, and errors introduced to provide a realistic scenario for data cleaning and exploratory data analysis (EDA). It can be used to practice cleaning techniques, data wrangling, and feature engineering.

File Information

File Name: dirty_cafe_sales.csv
Number of Rows: 10,000
Number of Columns: 8

Columns Description

Column Name	Description	Example Values
`Transaction ID`	A unique identifier for each transaction. Always present and unique.	`TXN_1234567`
`Item`	The name of the item purchased. May contain missing or invalid values (e.g., "ERROR").	`Coffee`, `Sandwich`
`Quantity`	The quantity of the item purchased. May contain missing or invalid values.	`1`, `3`, `UNKNOWN`
`Price Per Unit`	The price of a single unit of the item. May contain missing or invalid values.	`2.00`, `4.00`
`Total Spent`	The total amount spent on the transaction. Calculated as `Quantity * Price Per Unit`.	`8.00`, `12.00`
`Payment Method`	The method of payment used. May contain missing or invalid values (e.g., `None`, "UNKNOWN").	`Cash`, `Credit Card`
`Location`	The location where the transaction occurred. May contain missing or invalid values.	`In-store`, `Takeaway`
`Transaction Date`	The date of the transaction. May contain missing or incorrect values.	`2023-01-01`

Data Characteristics

Missing Values:
- Some columns (e.g., Item, Payment Method, Location) may contain missing values represented as None or empty cells.
Invalid Values:
- Some rows contain invalid entries like "ERROR" or "UNKNOWN" to simulate real-world data issues.
Price Consistency:
- Prices for menu items are consistent but may have missing or incorrect values introduced.

Menu Items

The dataset includes the following menu items with their respective price ranges:

Item	Price($)
Coffee	2
Tea	1.5
Sandwich	4
Salad	5
Cake	3
Cookie	1
Smoothie	4
Juice	3

Use Cases

This dataset is suitable for: - Practicing data cleaning techniques such as handling missing values, removing duplicates, and correcting invalid entries. - Exploring EDA techniques like visualizations and summary statistics. - Performing feature engineering for machine learning workflows.

Cleaning Steps Suggestions

To clean this dataset, consider the following steps: 1. Handle Missing Values: - Fill missing numeric values with the median or mean. - Replace missing categorical values with the mode or "Unknown."

Handle Invalid Values:
- Replace invalid entries like "ERROR" and "UNKNOWN" with NaN or appropriate values.
Date Consistency:
- Ensure all dates are in a consistent format.
- Fill missing dates with plausible values based on nearby records.
Feature Engineering:
- Create new columns, such as Day of the Week or Transaction Month, for further analysis.

License

This dataset is released under the CC BY-SA 4.0 License. You are free to use, share, and adapt it, provided you give appropriate credit.

Feedback

If you have any questions or feedback, feel free to reach out through the dataset's discussion board on Kaggle.

Data Carpentry Genomics Curriculum Example Data
figshare.com
datasetcatalog.nlm.nih.gov
application/gzip
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Olivier Tenaillon; Jeffrey E Barrick; Noah Ribeck; Daniel E. Deatherage; Jeffrey L. Blanchard; Aurko Dasgupta; Gabriel C. Wu; Sébastien Wielgoss; Stéphane Cruvellier; Claudine Medigue; Dominique Schneider; Richard E. Lenski; Taylor Reiter; Jessica Mizzi; Fotis Psomopoulos; Ryan Peek; Jason Williams (2023). Data Carpentry Genomics Curriculum Example Data [Dataset]. http://doi.org/10.6084/m9.figshare.7726454.v3
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7726454.v3
Dataset updated
May 31, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Olivier Tenaillon; Jeffrey E Barrick; Noah Ribeck; Daniel E. Deatherage; Jeffrey L. Blanchard; Aurko Dasgupta; Gabriel C. Wu; Sébastien Wielgoss; Stéphane Cruvellier; Claudine Medigue; Dominique Schneider; Richard E. Lenski; Taylor Reiter; Jessica Mizzi; Fotis Psomopoulos; Ryan Peek; Jason Williams
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 16.0px 'Andale Mono'; color: #29f914; background-color: #000000} span.s1 {font-variant-ligatures: no-common-ligatures} These files are intended for use with the Data Carpentry Genomics curriculum (https://datacarpentry.org/genomics-workshop/). Files will be useful for instructors teaching this curriculum in a workshop setting, as well as individuals working through these materials on their own.

This curriculum is normally taught using Amazon Web Services (AWS). Data Carpentry maintains an AWS image that includes all of the data files needed to use these lesson materials. For information on how to set up an AWS instance from that image, see https://datacarpentry.org/genomics-workshop/setup.html. Learners and instructors who would prefer to teach on a different remote computing system can access all required files from this FigShare dataset.

This curriculum uses data from a long term evolution experiment published in 2016: Tempo and mode of genome evolution in a 50,000-generation experiment (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4988878/) by Tenaillon O, Barrick JE, Ribeck N, Deatherage DE, Blanchard JL, Dasgupta A, Wu GC, Wielgoss S, Cruveiller S, Médigue C, Schneider D, and Lenski RE. (doi: 10.1038/nature18959). All sequencing data sets are available in the NCBI BioProject database under accession number PRJNA294072 (https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA294072).

backup.tar.gz: contains original fastq files, reference genome, and subsampled fastq files. Directions for obtaining these files from public databases are given during the lesson https://datacarpentry.org/wrangling-genomics/02-quality-control/index.html). On the AWS image, these files are stored in ~/.backup directory. 1.3Gb in size.

Ecoli_metadata.xlsx: an example Excel file to be loaded during the R lesson.

shell_data.tar.gz: contains the files used as input to the Introduction to the Command Line for Genomics lesson (https://datacarpentry.org/shell-genomics/).

sub.tar.gz: contains subsampled fastq files that are used as input to the Data Wrangling and Processing for Genomics lesson (https://datacarpentry.org/wrangling-genomics/). 109Mb in size.

solutions: contains the output files of the Shell Genomics and Wrangling Genomics lessons, including fastqc output, sam, bam, bcf, and vcf files.

vcf_clean_script.R: converts vcf output in .solutions/wrangling_solutions/variant_calling_auto to single tidy data frame.

combined_tidy_vcf.csv: output of vcf_clean_script.R
n
Data from: Generalizable EHR-R-REDCap pipeline for a national...
data.niaid.nih.gov
datadryad.org
zip
Updated Jan 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller (2022). Generalizable EHR-R-REDCap pipeline for a national multi-institutional rare tumor patient registry [Dataset]. http://doi.org/10.5061/dryad.rjdfn2zcm
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.rjdfn2zcm
Dataset updated
Jan 9, 2022
Dataset provided by
Harvard Medical School
Massachusetts General Hospital
Authors
Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Objective: To develop a clinical informatics pipeline designed to capture large-scale structured EHR data for a national patient registry.

Materials and Methods: The EHR-R-REDCap pipeline is implemented using R-statistical software to remap and import structured EHR data into the REDCap-based multi-institutional Merkel Cell Carcinoma (MCC) Patient Registry using an adaptable data dictionary.

Results: Clinical laboratory data were extracted from EPIC Clarity across several participating institutions. Labs were transformed, remapped and imported into the MCC registry using the EHR labs abstraction (eLAB) pipeline. Forty-nine clinical tests encompassing 482,450 results were imported into the registry for 1,109 enrolled MCC patients. Data-quality assessment revealed highly accurate, valid labs. Univariate modeling was performed for labs at baseline on overall survival (N=176) using this clinical informatics pipeline.

Conclusion: We demonstrate feasibility of the facile eLAB workflow. EHR data is successfully transformed, and bulk-loaded/imported into a REDCap-based national registry to execute real-world data analysis and interoperability.

Methods eLAB Development and Source Code (R statistical software):

eLAB is written in R (version 4.0.3), and utilizes the following packages for processing: DescTools, REDCapR, reshape2, splitstackshape, readxl, survival, survminer, and tidyverse. Source code for eLAB can be downloaded directly (https://github.com/TheMillerLab/eLAB).

eLAB reformats EHR data abstracted for an identified population of patients (e.g. medical record numbers (MRN)/name list) under an Institutional Review Board (IRB)-approved protocol. The MCCPR does not host MRNs/names and eLAB converts these to MCCPR assigned record identification numbers (record_id) before import for de-identification.

Functions were written to remap EHR bulk lab data pulls/queries from several sources including Clarity/Crystal reports or institutional EDW including Research Patient Data Registry (RPDR) at MGB. The input, a csv/delimited file of labs for user-defined patients, may vary. Thus, users may need to adapt the initial data wrangling script based on the data input format. However, the downstream transformation, code-lab lookup tables, outcomes analysis, and LOINC remapping are standard for use with the provided REDCap Data Dictionary, DataDictionary_eLAB.csv. The available R-markdown ((https://github.com/TheMillerLab/eLAB) provides suggestions and instructions on where or when upfront script modifications may be necessary to accommodate input variability.

The eLAB pipeline takes several inputs. For example, the input for use with the ‘ehr_format(dt)’ single-line command is non-tabular data assigned as R object ‘dt’ with 4 columns: 1) Patient Name (MRN), 2) Collection Date, 3) Collection Time, and 4) Lab Results wherein several lab panels are in one data frame cell. A mock dataset in this ‘untidy-format’ is provided for demonstration purposes (https://github.com/TheMillerLab/eLAB).

Bulk lab data pulls often result in subtypes of the same lab. For example, potassium labs are reported as “Potassium,” “Potassium-External,” “Potassium(POC),” “Potassium,whole-bld,” “Potassium-Level-External,” “Potassium,venous,” and “Potassium-whole-bld/plasma.” eLAB utilizes a key-value lookup table with ~300 lab subtypes for remapping labs to the Data Dictionary (DD) code. eLAB reformats/accepts only those lab units pre-defined by the registry DD. The lab lookup table is provided for direct use or may be re-configured/updated to meet end-user specifications. eLAB is designed to remap, transform, and filter/adjust value units of semi-structured/structured bulk laboratory values data pulls from the EHR to align with the pre-defined code of the DD.

Data Dictionary (DD)

EHR clinical laboratory data is captured in REDCap using the ‘Labs’ repeating instrument (Supplemental Figures 1-2). The DD is provided for use by researchers at REDCap-participating institutions and is optimized to accommodate the same lab-type captured more than once on the same day for the same patient. The instrument captures 35 clinical lab types. The DD serves several major purposes in the eLAB pipeline. First, it defines every lab type of interest and associated lab unit of interest with a set field/variable name. It also restricts/defines the type of data allowed for entry for each data field, such as a string or numerics. The DD is uploaded into REDCap by every participating site/collaborator and ensures each site collects and codes the data the same way. Automation pipelines, such as eLAB, are designed to remap/clean and reformat data/units utilizing key-value look-up tables that filter and select only the labs/units of interest. eLAB ensures the data pulled from the EHR contains the correct unit and format pre-configured by the DD. The use of the same DD at every participating site ensures that the data field code, format, and relationships in the database are uniform across each site to allow for the simple aggregation of the multi-site data. For example, since every site in the MCCPR uses the same DD, aggregation is efficient and different site csv files are simply combined.

Study Cohort

This study was approved by the MGB IRB. Search of the EHR was performed to identify patients diagnosed with MCC between 1975-2021 (N=1,109) for inclusion in the MCCPR. Subjects diagnosed with primary cutaneous MCC between 2016-2019 (N= 176) were included in the test cohort for exploratory studies of lab result associations with overall survival (OS) using eLAB.

Statistical Analysis

OS is defined as the time from date of MCC diagnosis to date of death. Data was censored at the date of the last follow-up visit if no death event occurred. Univariable Cox proportional hazard modeling was performed among all lab predictors. Due to the hypothesis-generating nature of the work, p-values were exploratory and Bonferroni corrections were not applied.
School data
kaggle.com
zip
Updated Nov 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Omokhefe Ogbodo (2024). School data [Dataset]. https://www.kaggle.com/datasets/victorogbodo/school-data
Explore at:
zip(7811 bytes)Available download formats
Dataset updated
Nov 20, 2024
Authors
Omokhefe Ogbodo
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset Overview This dataset simulates the academic and extracurricular records of students in a Nigerian primary school. It contains three tables designed to capture key aspects of the student lifecycle, including demographic information, academic scores, and their affiliations with sport houses. The dataset can be used for educational purposes, research, and exploratory data analysis.

Context and Inspiration This dataset is inspired by the structure of Nigerian primary schools, where students are grouped into sport houses for extracurricular activities and assessed on academic performance. It is a useful resource for: Exploring relationships between demographics, academic performance, and extracurricular activities. Analyzing patterns in hobbies and character traits. Creating visualizations for school or student performance analytics.

Usage This dataset is synthetic but can be used for: Data science practice, including cleaning, wrangling, and visualization. Developing machine learning models to predict academic outcomes or classify students. Creating dashboards and reports for educational analytics.

License This dataset is synthetic and open for public use. Feel free to use it for learning, research, and creative projects.

Acknowledgments The dataset was generated using Python libraries, including: Faker for generating realistic student data. Pandas for organizing and exporting the dataset.

Example Questions to Explore Which sport house has the best average performance in academics? Is there a correlation between hobbies and academic scores? Are there performance differences between male and female students? What is the distribution of student ages across sport houses?
f
MOESM3 of Wrangling environmental exposure data: guidance for getting the...
datasetcatalog.nlm.nih.gov
springernature.figshare.com
Updated Nov 22, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rudel, Ruthann; Udesky, Julia; Perovich, Laura; Dodson, Robin (2019). MOESM3 of Wrangling environmental exposure data: guidance for getting the best information from your laboratory measurements [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000171611
Explore at:
Dataset updated
Nov 22, 2019
Authors
Rudel, Ruthann; Udesky, Julia; Perovich, Laura; Dodson, Robin
Description
Additional file 3: Example of report formatting request to send to the lab.
b
Gold Standard and Annotation Dataset for CO2 Emissions Annotation
berd-platform.de
csv, zip
Updated Oct 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jacob Beck; Jacob Beck; Anna Steinberg; Anna Steinberg; Andreas Dimmelmeier; Andreas Dimmelmeier; Laia Domenech Burin; Laia Domenech Burin; Emily Kormanyos; Emily Kormanyos; Maurice Fehr; Malte Schierholz; Malte Schierholz; Maurice Fehr (2025). Gold Standard and Annotation Dataset for CO2 Emissions Annotation [Dataset]. http://doi.org/10.5281/zenodo.17076327
Explore at:
zip, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.17076327
Dataset updated
Oct 20, 2025
Dataset provided by
BERD@NFDI
Authors
Jacob Beck; Jacob Beck; Anna Steinberg; Anna Steinberg; Andreas Dimmelmeier; Andreas Dimmelmeier; Laia Domenech Burin; Laia Domenech Burin; Emily Kormanyos; Emily Kormanyos; Maurice Fehr; Malte Schierholz; Malte Schierholz; Maurice Fehr
Time period covered
Dec 10, 2024
Description
This repository contains the results of a research project which provides a benchmark dataset for extracting greenhouse gas emissions from corporate annual and sustainability reports. The paper which explains the data collection methodology and provides a detailed description of the benchmark dataset can be found in the Nature Scientific Data journal publication.

The zipped datasets file contains two datasets, gold_standard and annotation_dataset(inside the outer zip file there is a password-protected zip file containing the two datasets. To unpack, use the password is provided in the outer zip file).

Data collection

A Large Language Model (LLM) based pipeline was used to extract the greenhouse gas emissions from the reports (see columns prefixed with llm_ in annotation_dataset). The extracted emissions follow the categories Scope 1, 2 (market-based) and 2 (location-based) and 3, as defined in the GHGP protocol (see variables scope).

Annotation of the pipeline output was done in 3 phases: first by non-experts (see columns prefixed with non_expert_ in annotation_dataset), then by expert groups (columns prefixed with exp_group_ in annotation_dataset) in case of disagreement of non-experts and finally in a discussion of all experts (columns prefixed with exp_disc in annotation_dataset) in case of disagreement between expert groups. The annotation guidelines for the non-experts and experts are also included in this repository.

The annotation results from all three phases are combined to form the final benchmark dataset: gold_standard. Codebooks detailing each variable of each of the two datasets are also provided. More details about the annotation template or the data wrangling scripts can be found in the GitHub repository.

Merging of datasets

Users can match the two datasets (gold_standard and annotation_dataset) using the variable combination of company_name, report_year and merge_id (index column). The merge_id already includes the company name and report year implicitly, but to avoid column duplication in the join operation, it should be included as join variables. For example this is useful when comparing LLM extractions to gold standard data.
HVAC System Power Consumption and Sensor Data
kaggle.com
Updated Jun 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Félix del Prado Hurtado (2025). HVAC System Power Consumption and Sensor Data [Dataset]. https://www.kaggle.com/datasets/felixpradoh/hvac-system-power-consumption-and-sensor-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 21, 2025
Dataset provided by
Kaggle
Authors
Félix del Prado Hurtado
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Purpose of the Notebook

This notebook aims to predict the HVAC system's power consumption (active_power) at a given time using the previous 15 minutes of sensor and operational data. For example, to predict the power at 10:00, the model uses data from 9:45 to 10:00. The notebook provides data cleaning, feature engineering, and modeling steps for this predictive task. Additionally, it may require further feature engineering and data wrangling to enhance model performance and data usability.

Dataset Description

This dataset contains 3 months of historical data from an HVAC system, with records every 5 minutes. The data includes operational parameters and environmental sensor readings, both inside and outside the cooled space.

Variables Description

timestamp: Timestamp in the format YYYY-MM-DD HH:MM

on_off: Equipment on/off sensor (0=off, 1=on)

damper: Damper opening percentage (%)

active_energy: Active energy (kWh, estimated)

co2_1: CO2 sensor (ppm, estimated)

ambient_humidity: Indoor relative humidity (%)

active_power: Power consumption (Watts)

power_generated: Generated power, if available (Watts, estimated)

high_pressure_1: High pressure sensor 1 (bar)

high_pressure_2: High pressure sensor 2 (bar)

high_pressure_3: High pressure sensor 3 (bar)

low_pressure_1: Low pressure sensor 1 (bar)

low_pressure_2: Low pressure sensor 2 (bar)

low_pressure_3: Low pressure sensor 3 (bar)

outside_temp: Outside temperature (°C)

outlet_temp: Outlet air temperature (°C)

inlet_temp: Inlet air temperature (°C)

summer_SP_temp: Summer setpoint temperature (°C)

winter_SP_temp: Winter setpoint temperature (°C)

ambient_temp: Indoor ambient temperature (°C)
Wrangling Phosphoproteomic Data to Elucidate Cancer Signaling Pathways
plos.figshare.com
pdf
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mark L. Grimes; Wan-Jui Lee; Laurens van der Maaten; Paul Shannon (2023). Wrangling Phosphoproteomic Data to Elucidate Cancer Signaling Pathways [Dataset]. http://doi.org/10.1371/journal.pone.0052884
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0052884
Dataset updated
May 31, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Mark L. Grimes; Wan-Jui Lee; Laurens van der Maaten; Paul Shannon
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The interpretation of biological data sets is essential for generating hypotheses that guide research, yet modern methods of global analysis challenge our ability to discern meaningful patterns and then convey results in a way that can be easily appreciated. Proteomic data is especially challenging because mass spectrometry detectors often miss peptides in complex samples, resulting in sparsely populated data sets. Using the R programming language and techniques from the field of pattern recognition, we have devised methods to resolve and evaluate clusters of proteins related by their pattern of expression in different samples in proteomic data sets. We examined tyrosine phosphoproteomic data from lung cancer samples. We calculated dissimilarities between the proteins based on Pearson or Spearman correlations and on Euclidean distances, whilst dealing with large amounts of missing data. The dissimilarities were then used as feature vectors in clustering and visualization algorithms. The quality of the clusterings and visualizations were evaluated internally based on the primary data and externally based on gene ontology and protein interaction networks. The results show that t-distributed stochastic neighbor embedding (t-SNE) followed by minimum spanning tree methods groups sparse proteomic data into meaningful clusters more effectively than other methods such as k-means and classical multidimensional scaling. Furthermore, our results show that using a combination of Spearman correlation and Euclidean distance as a dissimilarity representation increases the resolution of clusters. Our analyses show that many clusters contain one or more tyrosine kinases and include known effectors as well as proteins with no known interactions. Visualizing these clusters as networks elucidated previously unknown tyrosine kinase signal transduction pathways that drive cancer. Our approach can be applied to other data types, and can be easily adopted because open source software packages are employed.
Census Income dataset
kaggle.com
zip
Updated Oct 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
tawfik elmetwally (2023). Census Income dataset [Dataset]. https://www.kaggle.com/datasets/tawfikelmetwally/census-income-dataset
Explore at:
zip(707150 bytes)Available download formats
Dataset updated
Oct 28, 2023
Authors
tawfik elmetwally
Description
This intermediate level data set was extracted from the census bureau database. There are 48842 instances of data set, mix of continuous and discrete (train=32561, test=16281).

The data set has 15 attribute which include age, sex, education level and other relevant details of a person. The data set will help to improve your skills in Exploratory Data Analysis, Data Wrangling, Data Visualization and Classification Models.

Feel free to explore the data set with multiple supervised and unsupervised learning techniques. The Following description gives more details on this data set:

age: the age of an individual.

workclass: The type of work or employment of an individual. It can have the following categories:

Private: Working in the private sector.

Self-emp-not-inc: Self-employed individuals who are not incorporated.

Self-emp-inc: Self-employed individuals who are incorporated.

Federal-gov: Working for the federal government.

Local-gov: Working for the local government.

State-gov: Working for the state government.

Without-pay: Not working and without pay.

Never-worked: Never worked before.

Final Weight: The weights on the CPS files are controlled to independent estimates of the civilian noninstitutional population of the US. These are prepared monthly for us by Population Division here at the Census Bureau. We use 3 sets of controls.

These are: 1. A single cell estimate of the population 16+ for each state. 2. Controls for Hispanic Origin by age and sex. 3. Controls by Race, age and sex.

We use all three sets of controls in our weighting program and "rake" through them 6 times so that by the end we come back to all the controls we used.

People with similar demographic characteristics should have similar weights. There is one important caveat to remember about this statement. That is that since the CPS sample is actually a collection of 51 state samples, each with its own probability of selection, the statement only applies within state.

education: The highest level of education completed.

education-num: The number of years of education completed.

marital-status: The marital status.

occupation: Type of work performed by an individual.

relationship: The relationship status.

race: The race of an individual.

sex: The gender of an individual.

capital-gain: The amount of capital gain (financial profit).

capital-loss: The amount of capital loss an individual has incurred.

hours-per-week: The number of hours works per week.

native-country: The country of origin or the native country.

income: The income level of an individual and serves as the target variable. It indicates whether the income is greater than $50,000 or less than or equal to $50,000, denoted as (>50K, <=50K).
Additional file 2 of Genomic data integration and user-defined sample-set...
springernature.figshare.com
xlsx
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tommaso Alfonsi; Anna Bernasconi; Arif Canakoglu; Marco Masseroli (2023). Additional file 2 of Genomic data integration and user-defined sample-set extraction for population variant analysis [Dataset]. http://doi.org/10.6084/m9.figshare.21251615.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21251615.v1
Dataset updated
Jun 4, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Tommaso Alfonsi; Anna Bernasconi; Arif Canakoglu; Marco Masseroli
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Additional file 2. Example of transformed metadata: In this .xlsx (MS Excel) file, we list all the output metadata categories generated for each sample from the transformation of the 1KGP input datasets. The output metadata include information collected from all the four 1KGP metadata files considered. Some categories are not reported in the source metadata files—they are identified by the label manually_curated_...—and were added by the developed pipeline to store technical details (e.g., download date, the md5 hash of the source file, file size, etc.) and information derived from the knowledge of the source, such as the species, the processing pipeline used in the source and the health status. For every information category, the table reports a possible value. The third column (cardinality > 1) tells whether the same key can appear multiple times in the output GDM metadata file. This is used to represent multi-valued metadata categories; for example, in a GDM metadata file, the key manually_curated_chromosome appears once for every chromosome mutated by the variants of the sample.
Owl Observations and Photoperiod
kaggle.com
zip
Updated Dec 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rebecca Ware (2024). Owl Observations and Photoperiod [Dataset]. https://www.kaggle.com/datasets/rebeccaware/dw-final-project-owls-and-sun
Explore at:
zip(1143982 bytes)Available download formats
Dataset updated
Dec 10, 2024
Authors
Rebecca Ware
Description
As a final project for Data Wrangling this fall (2024) we were tasked with using our new skills in collecting and importing data using web scraping, online API queries, and file import to create a relational data set of 3 tables with 2 related. We also had to use our tidying skills to clean and transform our inputted data to prepare it for visualization and analysis by focusing on column types and names, categorical variables, etc.

An example notebook of analysis of this information is provided with 5 example of analysis using mutating joins, tidying, and/or ggplot.
iNeuron Projectathon Oct-Nov'21
kaggle.com
zip
Updated Oct 22, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aman Anand (2021). iNeuron Projectathon Oct-Nov'21 [Dataset]. https://www.kaggle.com/yekahaaagayeham/ineuron-projectathon-octnov21
Explore at:
zip(3335989 bytes)Available download formats
Dataset updated
Oct 22, 2021
Authors
Aman Anand
Description
iNeuron-Projectathon-Oct-Nov-21

Problem Statement:

Design a web portal to automate the various operation performed in machine learning projects to solve specific problems related to supervised or unsupervised use case.. Web portal must have the capabilities to perform below-mentioned task: 1. Extract Transform Load: a. Extract: Portal should provide the capabilities to configure any data source example. Cloud Storage (AWS, Azure, GCP), Database (RDBMS, NoSQL,), and real-time streaming data to extract data into portportal. (Allow feasibility to write cucustom script if required to connect to any data source to extract data) b. Transform: Portal should provide various inbuilt functions/components to apply rich set of transformation to transform extracted data into desired format. c. Load: Portal should be able to save data into any of the cloud storage after extracted data transformed into desired format. d. Allow user to write custom script in python if some of the functionality is not present in the portal. 2. Exploratory Data Analysis: Portal should allow users to perform exploratory data analysis. 3. Data Preparation: data wrangling, feature extraction and feature selection should be automation with minimal user intervention. 4. Application must suggest appropriate machine learning algorithm which is best suitable for the use case and perform best model search operation to automate model development operation. 5. Application should provide feature to deploy model in any of the cloud and application should create prediction API to predict new instances. 6. Application should log each and every detail so that each activity can be audited in future to investigate any of the event. 7. Detail report should be generated for ETL, EDA, Data preparation and Model development and deployment. 8. Create a dashboard to monitor model performance and create various alert mechanism to notify appropriate user to take necessary precaution. 9. Create functionality to implement retraining for existing model if it is necessary. 10.Portal must be designed in such a way that it can be used by multiple organization/user where each organization/user is isolated from other. 11.Portal should provide functionality to manage user. Similar to RBAC concept used in Cloud. (It is not necessary to build so many role but design it in such a way that it can add role in future so that newly created role can also be applied to users.) Organization/User can have multiple user and each user will have specific role. 12.Portal should have a scheduler to schedule training or prediction task and appropriate alert regarding to scheduled job should be notified to subscriber/configured email id. 13.Implement watcher functionality to perform prediction as soon as file arrived at input location.

Approach:

Follow standard guild line to write quality solution for web portal.

Follow OOPS to design solution.

Implement REST API wherever possible.

Implement CI, CD pipeline with automated testing and dockerization. (Use container or Kubernetes to deploy your dockerized application)

CI, CD pipeline should have different environment example ST, SST, Production. Note: Feel free to use any of the technology to design your solution.

Results:

You have to build a solution that should summarize the various news articles from different reading categories.

Project Evaluation metrics:

Code:  You are supposed to write a code in a modular fashion  Safe: It can be used without causing harm.  Testable: It can be tested at the code level.  Maintainable: It can be maintained, even as your codebase grows.  Portable: It works the same in every environment (operating system)  You have to maintain your code on GitHub.  You have to keep your GitHub repo public so that anyone can check your code.  Proper readme file you have to maintain for any project development.  You should include basic workflow and execution of the entire project in the readme

file on GitHub  Follow the coding standards: https://www.python.org/dev/peps/pep-0008/

Database: Based on development requirement feel free to choose any database (SQL,

NoSQL) or use multiple database.

Cloud:

 You can use any cloud platform for this entire solution hosting like AWS, Azure or GCP.

API Details or User Interface:

Web portal should be designed like any cloud platform.

Model developed using web portal should have functionality to expose API to test prediction.

Logging:

 Logging is a must for every action performed by your code use the python logging library for this.

DevOps Pipeline:

Use source version control tool to implement CI, CD pipeline, e.g.: Azure Devops, Github, Circle CI.

Deployment:

 You can host your application in the cloud platform using automated CI, CD pipeline.

Solutions Design:

 You have to submit complete solution design strate...
US stocks short volumes (FINRA)
kaggle.com
zip
Updated Mar 16, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DenzilG (2021). US stocks short volumes (FINRA) [Dataset]. https://www.kaggle.com/denzilg/finra-short-volumes-us-equities
Explore at:
zip(46837645 bytes)Available download formats
Dataset updated
Mar 16, 2021
Authors
DenzilG
License
https://www.usa.gov/government-works/https://www.usa.gov/government-works/
Description
Inspiration

Originally, I was planning to use the Python Quandl api to get the data from here because it is already conveniently in time-series format. However, the data is split by reporting agency which makes it difficult to get an accurate image of the true short ratio because of missing data/difficulty in aggregation. So, I clicked on the source link which turned out to be a gold mine because of their consolidated data. Only downside was that it was all in .txt format so I had to use regex to parse through and data scraping to get the information from the website but that was a good refresher 😄.

For better understanding of what the values in the text file mean, you can read this pdf from FINRA: https://www.finra.org/sites/default/files/2020-12/short-sale-volume-user-guide.pdf

Functionality

I condensed all the individual text files into a single .txt file such that it's much faster and less complex to write code compared to having to iterate through each individual .txt file. I created several functions for this dataset so please check out my workbook "FINRA Short Ratio functions" where I have described step by step on how I gathered the data and formatted it so that you can understand and modify them to fit your needs. Note that the data is only for the range of 1st April 2020 onwards (20200401 to 20210312 as of gathering the data) and the contents are separated by | delimiters so I used \D (non-digit) in regex to avoid confusion with the (a|b) pattern syntax.

If you need historical data before April 2020, you can use the quandl database but it has non-consolidated information and you have to make a reference call for each individual stock for each agency so you would need to manually input tickers or get a list of all tickers through regex of the txt files or something like that 😅.

Thoughts

An excellent task to combine regular expressions (regex), web scraping, plotting, and data wrangling... see my notebook for an example with annotated workflow. Please comment and feel free to fork and modify my workbook to change the functionality. Possibly the short volumes can be combined with p/b ratios or price data to see the correlation --> can use seaborn pairgrid to visualise this for multiple stocks?
NBA data from 1996-2021
kaggle.com
zip
Updated Mar 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
patrick (2022). NBA data from 1996-2021 [Dataset]. https://www.kaggle.com/patrickhallila1994/nba-data-from-basketball-reference
Explore at:
zip(178270973 bytes)Available download formats
Dataset updated
Mar 2, 2022
Authors
patrick
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

These datasets are scraped from basketball-reference.com and include all NBA games between seasons 1996-97 to 2020-21. My goal in this project is to create datasets that can be used by beginners to practice basic data science skills, such as data wrangling and cleaning, using a setting in which it is easy to go to the raw data to understand surprising results. For example, outliers can be difficult to understand when working with a taxi dataset, whereas NBA has a large community of reporters, experts and game videos that may help you understand what is going on with the data.

Web scrapers used to collect can be found from: https://github.com/PatrickH1994/nba_webscrapes

Content

The dataset will include all the information available at basketball-reference.com once the project is done.

Current files: - Games - Play-by-play - Player stats - Salary data

Acknowledgements

We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

Inspiration

Your data will be in front of the world's largest data science community. What questions do you want to see answered?
NFL Combine - Performance Data (2009 - 2019)
kaggle.com
zip
Updated Dec 23, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ronnie (2021). NFL Combine - Performance Data (2009 - 2019) [Dataset]. https://www.kaggle.com/datasets/redlineracer/nfl-combine-performance-data-2009-2019/suggestions
Explore at:
zip(153250 bytes)Available download formats
Dataset updated
Dec 23, 2021
Authors
Ronnie
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

This dataset contains information from the NFL Combine (2009 to 2019), including the results from sports performance tests and draft outcomes.

Content

As sports statistics are in the public domain, this database was freely downloaded from https://www.pro-football-reference.com/

Acknowledgements

I appreciate the efforts of https://www.pro-football-reference.com/ in collating and hosting sports related data, and Kaggle for providing a platform for sharing datasets and knowledge.

Inspiration

This dataset is useful for beginners and intermediate users, where they can practice visualisations, analytics, imputation, data cleaning/wrangling, and classification modelling. For example: What are the variables of importance in predicing round pick or draft status? Which school has the highest number of players being drafted into NFL? What position type or player type is most represented at the NFL Combine? Do drafted and undrafted players perform differently on performance tests?
Covid 19 clinical trials data
kaggle.com
zip
Updated Jul 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ishita Goswami (2025). Covid 19 clinical trials data [Dataset]. https://www.kaggle.com/datasets/ishiigoswami/covid-19-clinical-trials-data
Explore at:
zip(3788883 bytes)Available download formats
Dataset updated
Jul 23, 2025
Authors
Ishita Goswami
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
🧪 Covid-19 Clinical Trials Dataset (Raw + Cleaned)

This dataset offers a deep look into the global clinical research landscape during the Covid-19 pandemic. Sourced directly from ClinicalTrials.gov, it provides structured and semi-structured information on registered Covid-19-related clinical trials across countries, sponsors, and phases.

📁 What’s Included • COVID_clinical_trials.csv — Raw dataset as obtained from ClinicalTrials.gov • Covid-19_cleaned_dataset.csv — Preprocessed version for direct use in data analysis and visualization tasks

🎯 Use Case & Learning Goals

This dataset is ideal for: • Practicing data cleaning, preprocessing, and wrangling • Performing exploratory data analysis (EDA) • Building interactive dashboards (e.g., with Tableau or Plotly) • Training ML models for classification or forecasting (e.g., predicting trial outcomes) • Exploring trends in clinical trial research during global health emergencies

🔍 Key Features

Each row represents a registered clinical trial and includes fields such as: • NCT Number (unique ID) • Study Title • Start Date and Completion Date • Phase • Study Type (Interventional/Observational) • Enrollment Size • Country, Sponsor, and Intervention Type • Study Status (Recruiting, Completed, Withdrawn, etc.)

✅ Cleaned Dataset

The cleaned version includes: • Standardized column naming • Filled missing values where possible • Removed duplicates and a few columns

📊 Example Applications • Country-wise contribution analysis • Sponsor landscape visualization • Trial timeline and phase progression charts • Predictive modeling of trial duration or status

🙏 Acknowledgments

Thanks to ClinicalTrials.gov for providing public access to this critical data.
Jokes Generation Research
kaggle.com
zip
Updated Sep 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). Jokes Generation Research [Dataset]. https://www.kaggle.com/datasets/thedevastator/one-million-reddit-jokes
Explore at:
zip(97495259 bytes)Available download formats
Dataset updated
Sep 4, 2022
Authors
The Devastator
License
https://www.reddit.com/wiki/apihttps://www.reddit.com/wiki/api
Description
One Million Reddit Jokes

Introduction

Overview The dataset was downloaded as a CSV file containing 1M posts from the r/Jokes subreddit. Of the relevant features, the "title" is the title's post or the joke's setup. The "selftext" is the punchline, or what you see once a user clicks on the post's content. It's worth nothin that many jokes in this data table don't meet this criterion (nans).

Score The "score" value describes the number of upvotes, i.e. the number of positive ratings the post received. Posts can additionally be downvoted, and while Reddit allows for negative values, the minimum value in the dataset is zero. When a user posts something to Reddit, however, they are automatically given a single upvote, so I am making the assumption that values of zero in this dataset were downvoted.

Original Source

Project Ideas

Exploratory Data Analysis - Try to understand intuitively "what makes a joke funny" using simple exploratory data analysis.

Funny / Not Funny - Classification - The ultimate goal in wrangling these data is to create a dataset to classify as either funny or not funny using the upvotes.

Jokes Generation - Train and generate jokes using a language generation model (GPT for example).

Funny Jokes Generation - Training and generating jokes using language models is one thing but generating Funny jokes using language models is a completely different task! (which is much much harder to do)

Are you up for a challenge? ;)
Oslo City Bike Open Data
kaggle.com
zip
Updated Nov 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
stanislav_o27 (2025). Oslo City Bike Open Data [Dataset]. https://www.kaggle.com/datasets/stanislavo27/oslo-city-bike-open-data
Explore at:
zip(251012812 bytes)Available download formats
Dataset updated
Nov 8, 2025
Authors
stanislav_o27
Area covered
Oslo
Description
Source: https://oslobysykkel.no/en/open-data/historical

I am not the author of the data, only сompiled and structured from here using python-script

oslo-city-bike License: Norwegian Licence for Open Government Data (NLOD) 2.0 According to the license, we have full rights to collect, use, modify, and distribute this data, provided you clearly indicate the source (which I do).

Dataset structure

Folder oslobysykkel contains all available data from 2019 to 2025. Format: oslobysykkel-YYYY-MM.csv. why is oslo still appearing in the file names? because there is also similar data for Trondheim and Bergen

Variables

from oslobysykkel.no Variable Format Description started_at Timestamp Timestamp of when the trip started ended_at Timestamp Timestamp of when the trip ended duration Integer Duration of trip in seconds start_station_id String Unique ID for start station start_station_name String Name of start station start_station_description String Description of where start station is located start_station_latitude Decimal degrees in WGS84 Latitude of start station start_station_longitude Decimal degrees in WGS84 Longitude of start station end_station_id String Unique ID for end station end_station_name String Name of end station end_station_description String Description of where end station is located end_station_latitude Decimal degrees in WGS84 Latitude of end station end_station_longitude Decimal degrees in WGS84 Longitude of end station

Please note: this data and my analysis focuses on the new data format, but historical data for the period April 2016 - December 2018 (Legacy Trip Data) has a different pattern.

Motivation

I myself was extremely fascinated by this open data of Oslo City Bike and in the process of deep analysis saw broad prospects. This interest turned into an idea to create a data-analytical problem book or even platfrom 'exercise bike'. Publishing this dataset to make it convenient for my own further use in the next phases of the project (Clustering, Forecasting), as well as so that anyone can participate in analysis and modeling based on this exciting data.

**Autumn's remake of Oslo bike sharing data analysis ** https://colab.research.google.com/drive/1tAxrIWVK5V-ptKLJBdODjy10zHlsppFv?usp=sharing

https://drive.google.com/file/d/17FP9Bd5opoZlw40LRxWtycgJJyXSAdC6/view

Full notebooks with code, visualizations, and commentary will be published soon! This dataset is the backbone of an ongoing project — stay tuned for see a deeper dives into anomaly detection, station clustering, and interactive learning challenges.

Index of my notebooks Phase 1: Cleaned Data & Core Insights Time-Space Dynamics Exploratory

Challenge Ideas

Clustering and Segmentation Demand Forecasting (Time Series) Geospatial Analysis (Network Analysis)

Resources & Related Work

Similar dataset https://www.kaggle.com/code/florestancharlaix/oslo-city-bikes-analysis

links to works I have found or that have inspired me

Exploring Open Data from Oslo City Bike Jon Olave — visualization of popular routes and seasonality analysis.

Oslo City Bike Data Wrangling Karl Tryggvason — predicting bicycle availability at stations, focusing on everyday use (e.g., trips to kindergarten).

Helsinki City Bikes: Exploratory Data Analysis Analysis of a similar system in Helsinki — useful for comparative studies and methodological ideas.

External Data Sources

The idea is to connect with other data. For example I did it for weather data - integrate temperature, precipitation, and wind speed to explain variations in daily demand. https://meteostat.net/en/place/no/oslo

I also used data from Airbnb (that's where I took division into neighbourhoods) https://data.insideairbnb.com/norway/oslo/oslo/2025-06-27/visualisations/neighbourhoods.csv

oslo bike-sharing eda feature-engineering geospatial time-series
Honey Production in the USA (1998-2012)
kaggle.com
zip
Updated Apr 9, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jessica Li (2018). Honey Production in the USA (1998-2012) [Dataset]. https://www.kaggle.com/jessicali9530/honey-production
Explore at:
zip(25050 bytes)Available download formats
Dataset updated
Apr 9, 2018
Authors
Jessica Li
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
United States
Description
Context

In 2006, global concern was raised over the rapid decline in the honeybee population, an integral component to American honey agriculture. Large numbers of hives were lost to Colony Collapse Disorder, a phenomenon of disappearing worker bees causing the remaining hive colony to collapse. Speculation to the cause of this disorder points to hive diseases and pesticides harming the pollinators, though no overall consensus has been reached. Twelve years later, some industries are observing recovery but the American honey industry is still largely struggling. The U.S. used to locally produce over half the honey it consumes per year. Now, honey mostly comes from overseas, with 350 of the 400 million pounds of honey consumed every year originating from imports. This dataset provides insight into honey production supply and demand in America by state from 1998 to 2012.

Content

The National Agricultural Statistics Service (NASS) is the primary data reporting body for the US Department of Agriculture (USDA). NASS's mission is to "provide timely, accurate, and useful statistics in service to U.S. agriculture". From datasets to census surveys, their data covers virtually all aspects of U.S. agriculture. Honey production is one of the datasets offered. Click here for the original page containing the data along with related datasets such as Honey Bee Colonies and Cost of Pollination. Data wrangling was performed in order to clean the dataset. honeyproduction.csv is the final tidy dataset suitable for analysis. The three other datasets (which include "honeyraw" in the title) are the original raw data downloaded from the site. They are uploaded to this page along with the "**Wrangling The Honey Production Dataset**" kernel as an example to show users how data can be wrangled into a cleaner format. Useful metadata on certain variables of the honeyproduction dataset is provided below:

numcol: Number of honey producing colonies. Honey producing colonies are the maximum number of colonies from which honey was taken during the year. It is possible to take honey from colonies which did not survive the entire year

yieldpercol: Honey yield per colony. Unit is pounds

totalprod: Total production (numcol x yieldpercol). Unit is pounds

stocks: Refers to stocks held by producers. Unit is pounds

priceperlb: Refers to average price per pound based on expanded sales. Unit is dollars.

prodvalue: Value of production (totalprod x priceperlb). Unit is dollars.

Other useful information: Certain states are excluded every year (ex. CT) to avoid disclosing data for individual operations. Due to rounding, total colonies multiplied by total yield may not equal production. Also, summation of states will not equal U.S. level value of production.

Acknowledgements

Honey production data was published by the National Agricultural Statistics Service (NASS) of the U.S. Department of Agriculture. The beautiful banner photo was by Eric Ward on Unsplash.

Inspiration

How has honey production yield changed from 1998 to 2012?

Over time, which states produce the most honey? Which produce the least? Which have experienced the most change in honey yield?

Does the data show any trends in terms of the number of honey producing colonies and yield per colony before 2006, which was when concern over Colony Collapse Disorder spread nationwide?

Are there any patterns that can be observed between total honey production and value of production every year? How has value of production, which in some sense could be tied to demand, changed every year?

Facebook

Twitter

Click to copy link

Link copied

Cite

ringhilterra17 (2020). Enriched NYTimes COVID19 U.S. County Dataset [Dataset]. https://www.kaggle.com/ringhilterra17/enrichednytimescovid19

Enriched NYTimes COVID19 U.S. County Dataset

NYTimes U.S. County Data joined + enriched with geospatial, per-capita, and more

Explore at:

zip(11291611 bytes)Available download formats

Dataset updated

Jun 14, 2020

Authors

ringhilterra17

License

Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically

Area covered

United States

Description

Overview and Inspiration

I wanted to make some geospatial visualizations to convey the current severity of COVID19 in different parts of the U.S..

I liked the NYTimes COVID dataset, but it was lacking information on county boundary shape data, population per county, new cases / deaths per day, and per capita calculations, and county demographics.

After a lot of work tracking down the different data sources I wanted and doing all of the data wrangling and joins in python, I wanted to open-source the final enriched data set in order to give others a head start in their COVID-19 related analytic, modeling, and visualization efforts.

This dataset is enriched with county shapes, county center point coordinates, 2019 census population estimates, county population densities, cases and deaths per capita, and calculated per day cases / deaths metrics. It contains daily data per county back to January, allowing for analyizng changes over time.

UPDATE: I have also included demographic information per county, including ages, races, and gender breakdown. This could help determine which counties are most susceptible to an outbreak.

How this data can be used

Geospatial analysis and visualization - Which counties are currently getting hit the hardest (per capita and totals)? - What patterns are there in the spread of the virus across counties? (network based spread simulations using county center lat / lons) -county population densities play a role in how quickly the virus spreads? -how does a specific county/state cases and deaths compare to other counties/states? Join with other county level datasets easily (with fips code column)

Content Details

See the column descriptions for more details on the dataset

Visualizations and Analysis Examples

COVID-19 U.S. Time-lapse: Confirmed Cases per County (per capita)

https://github.com/ringhilterra/enriched-covid19-data/blob/master/example_viz/covid-cases-final-04-06.gif?raw=true" alt="">-

Other Data Notes

Please review nytimes README for detailed notes on Covid-19 data - https://github.com/nytimes/covid-19-data/
The only update I made in regards to 'Geographic Exceptions', is that I took 'New York City' county provided in the Covid-19 data, which has all cases for 'for the five boroughs of New York City (New York, Kings, Queens, Bronx and Richmond counties) and replaced the missing FIPS for those rows with the 'New York County' fips code 36061. That way I could join to a geometry, and then I used the sum of those five boroughs population estimates for the 'New York City' estimate, which allowed me calculate 'per capita' metrics for 'New York City' entries in the Covid-19 dataset

Acknowledgements

Special thanks to NYTimes for all of their hard work gathering and consolidating all of the U.S. COVID19 related data on daily basis. Their git repo https://github.com/nytimes/covid-19-data/
Also, thanks to ykzeng for the county population density estimates: https://github.com/ykzeng/covid-19/tree/master/data-

Clear search

Close search

Google apps

Main menu

Enriched NYTimes COVID19 U.S. County Dataset

Overview and Inspiration

How this data can be used

Content Details

Visualizations and Analysis Examples

Other Data Notes

Acknowledgements

Cafe Sales - Dirty Data for Cleaning Training

Dirty Cafe Sales Dataset

Overview

File Information

Columns Description

Data Characteristics

Menu Items

Use Cases

Cleaning Steps Suggestions

License

Feedback

Data Carpentry Genomics Curriculum Example Data

Data from: Generalizable EHR-R-REDCap pipeline for a national...

School data

MOESM3 of Wrangling environmental exposure data: guidance for getting the...

Gold Standard and Annotation Dataset for CO2 Emissions Annotation

Data collection

Merging of datasets

HVAC System Power Consumption and Sensor Data

Purpose of the Notebook

Dataset Description

Variables Description

Wrangling Phosphoproteomic Data to Elucidate Cancer Signaling Pathways

Census Income dataset

Additional file 2 of Genomic data integration and user-defined sample-set...

Owl Observations and Photoperiod

iNeuron Projectathon Oct-Nov'21

iNeuron-Projectathon-Oct-Nov-21

Problem Statement:

Approach:

Results:

Project Evaluation metrics:

Database: Based on development requirement feel free to choose any database (SQL,

Cloud:

API Details or User Interface:

Logging:

DevOps Pipeline:

Deployment:

Solutions Design:

US stocks short volumes (FINRA)

Inspiration

Functionality

Thoughts

NBA data from 1996-2021

Context

Content

Acknowledgements

Inspiration

NFL Combine - Performance Data (2009 - 2019)

Context

Content

Acknowledgements

Inspiration

Covid 19 clinical trials data

Jokes Generation Research

One Million Reddit Jokes

Introduction

Project Ideas

Oslo City Bike Open Data

Source: https://oslobysykkel.no/en/open-data/historical

I am not the author of the data, only сompiled and structured from here using python-script

Dataset structure

Variables

Motivation

Challenge Ideas

Resources & Related Work

External Data Sources

Honey Production in the USA (1998-2012)

Context

Content

Acknowledgements

Inspiration

Enriched NYTimes COVID19 U.S. County Dataset