25 datasets found

Data Wrangling Market Analysis North America, Europe, APAC, Middle East and...
technavio.com
pdf
Updated Oct 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Technavio (2024). Data Wrangling Market Analysis North America, Europe, APAC, Middle East and Africa, South America - US, UK, Germany, China, Japan - Size and Forecast 2024-2028 [Dataset]. https://www.technavio.com/report/data-wrangling-market-industry-analysis
Explore at:
pdfAvailable download formats
Dataset updated
Oct 4, 2024
Dataset provided by
TechNavio
Authors
Technavio
License
https://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice
Time period covered
2024 - 2028
Area covered
United Kingdom, United States
Description
Snapshot img

Data Wrangling Market Size 2024-2028

The data wrangling market size is forecast to increase by USD 1.4 billion at a CAGR of 14.8% between 2023 and 2028. The market is experiencing significant growth due to the numerous benefits provided by data wrangling solutions, including data cleaning, transformation, and enrichment. One major trend driving market growth is the rising need for technology such as the competitive intelligence and artificial intelligence in the healthcare sector, where data wrangling is essential for managing and analyzing patient data to improve patient outcomes and reduce costs. However, a challenge facing the market is the lack of awareness of data wrangling tools among small and medium-sized enterprises (SMEs), which limits their ability to effectively manage and utilize their data. Despite this, the market is expected to continue growing as more organizations recognize the value of data wrangling in driving business insights and decision-making.

What will be the Size of the Market During the Forecast Period?

Request Free Sample

The market is experiencing significant growth due to the increasing demand for data management and analysis in various industries. The market is experiencing significant growth due to the increasing volume, variety, and velocity of data being generated from various sources such as IoT devices, financial services, and smart cities. Artificial intelligence and machine learning technologies are being increasingly used for data preparation, data cleaning, and data unification. Data wrangling, also known as data munging, is the process of cleaning, transforming, and enriching raw data to make it usable for analysis. This process is crucial for businesses aiming to gain valuable insights from their data and make informed decisions. Data analytics is a primary driver for the market, as organizations seek to extract meaningful insights from their data. Cloud solutions are increasingly popular for data wrangling due to their flexibility, scalability, and cost-effectiveness.

Furthermore, both on-premises and cloud-based solutions are being adopted by businesses to meet their specific data management requirements. Multi-cloud strategies are also gaining traction in the market, as organizations seek to leverage the benefits of multiple cloud providers. This approach allows businesses to distribute their data across multiple clouds, ensuring business continuity and disaster recovery capabilities. Data quality is another critical factor driving the market. Ensuring data accuracy, completeness, and consistency is essential for businesses to make reliable decisions. The market is expected to grow further as organizations continue to invest in big data initiatives and implement advanced technologies such as AI and ML to gain a competitive edge. Data cleaning and data unification are key processes in data wrangling that help improve data quality. The finance and insurance industries are major contributors to the market, as they generate vast amounts of data daily.

In addition, real-time analysis is becoming increasingly important in these industries, as businesses seek to gain insights from their data in near real-time to make informed decisions. The Internet of Things (IoT) is also driving the market, as businesses seek to collect and analyze data from IoT devices to gain insights into their operations and customer behavior. Edge computing is becoming increasingly popular for processing IoT data, as it allows for faster analysis and decision-making. Self-service data preparation is another trend in the market, as businesses seek to empower their business users to prepare their data for analysis without relying on IT departments.

Moreover, this approach allows businesses to be more agile and responsive to changing business requirements. Big data is another significant trend in the market, as businesses seek to manage and analyze large volumes of data to gain insights into their operations and customer behavior. Data wrangling is a critical process in managing big data, as it ensures that the data is clean, transformed, and enriched to make it usable for analysis. In conclusion, the market in North America is experiencing significant growth due to the increasing demand for data management and analysis in various industries. Cloud solutions, multi-cloud strategies, data quality, finance and insurance, IoT, real-time analysis, self-service data preparation, and big data are some of the key trends driving the market. Businesses that invest in data wrangling solutions can gain a competitive edge by gaining valuable insights from their data and making informed decisions.

Market Segmentation

The market research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD billion' for the period 2024-2028, as well as historical data from 2018-2022 for the following segments.

Sector
Enriched NYTimes COVID19 U.S. County Dataset
kaggle.com
zip
Updated Jun 14, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ringhilterra17 (2020). Enriched NYTimes COVID19 U.S. County Dataset [Dataset]. https://www.kaggle.com/ringhilterra17/enrichednytimescovid19
Explore at:
zip(11291611 bytes)Available download formats
Dataset updated
Jun 14, 2020
Authors
ringhilterra17
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Area covered
United States
Description
Overview and Inspiration

I wanted to make some geospatial visualizations to convey the current severity of COVID19 in different parts of the U.S..

I liked the NYTimes COVID dataset, but it was lacking information on county boundary shape data, population per county, new cases / deaths per day, and per capita calculations, and county demographics.

After a lot of work tracking down the different data sources I wanted and doing all of the data wrangling and joins in python, I wanted to open-source the final enriched data set in order to give others a head start in their COVID-19 related analytic, modeling, and visualization efforts.

This dataset is enriched with county shapes, county center point coordinates, 2019 census population estimates, county population densities, cases and deaths per capita, and calculated per day cases / deaths metrics. It contains daily data per county back to January, allowing for analyizng changes over time.

UPDATE: I have also included demographic information per county, including ages, races, and gender breakdown. This could help determine which counties are most susceptible to an outbreak.

How this data can be used

Geospatial analysis and visualization - Which counties are currently getting hit the hardest (per capita and totals)? - What patterns are there in the spread of the virus across counties? (network based spread simulations using county center lat / lons) -county population densities play a role in how quickly the virus spreads? -how does a specific county/state cases and deaths compare to other counties/states? Join with other county level datasets easily (with fips code column)

Content Details

See the column descriptions for more details on the dataset

Visualizations and Analysis Examples

COVID-19 U.S. Time-lapse: Confirmed Cases per County (per capita)

https://github.com/ringhilterra/enriched-covid19-data/blob/master/example_viz/covid-cases-final-04-06.gif?raw=true" alt="">-

Other Data Notes

Please review nytimes README for detailed notes on Covid-19 data - https://github.com/nytimes/covid-19-data/

The only update I made in regards to 'Geographic Exceptions', is that I took 'New York City' county provided in the Covid-19 data, which has all cases for 'for the five boroughs of New York City (New York, Kings, Queens, Bronx and Richmond counties) and replaced the missing FIPS for those rows with the 'New York County' fips code 36061. That way I could join to a geometry, and then I used the sum of those five boroughs population estimates for the 'New York City' estimate, which allowed me calculate 'per capita' metrics for 'New York City' entries in the Covid-19 dataset

Acknowledgements

Special thanks to NYTimes for all of their hard work gathering and consolidating all of the U.S. COVID19 related data on daily basis. Their git repo https://github.com/nytimes/covid-19-data/

Also, thanks to ykzeng for the county population density estimates: https://github.com/ykzeng/covid-19/tree/master/data-
f
Data and R code for analysis.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Apr 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Phillips, Gareth L.; Bray, Peran; Fernandes, Leanne; Mosquera, Enrique; Bonin, Mary; Beard, Daniel; Morris, Sheriden; Godoy, Dan; Abom, Rickard T. M.; Matthews, Samuel A.; Campili, Adriana R.; Tracey, Dieter; Taylor, Sascha; Jonker, Michelle J.; Lang, Bethan J.; Sinclair-Taylor, Tane H.; Emslie, Michael J.; Hemingson, Christopher R.; Wilmes, Jennifer C.; Ceccarelli, Daniela M.; Williamson, David H.; Quincey, Richard; Beeden, Roger; Fletcher, Cameron S. (2024). Data and R code for analysis. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001394297
Explore at:
Dataset updated
Apr 24, 2024
Authors
Phillips, Gareth L.; Bray, Peran; Fernandes, Leanne; Mosquera, Enrique; Bonin, Mary; Beard, Daniel; Morris, Sheriden; Godoy, Dan; Abom, Rickard T. M.; Matthews, Samuel A.; Campili, Adriana R.; Tracey, Dieter; Taylor, Sascha; Jonker, Michelle J.; Lang, Bethan J.; Sinclair-Taylor, Tane H.; Emslie, Michael J.; Hemingson, Christopher R.; Wilmes, Jennifer C.; Ceccarelli, Daniela M.; Williamson, David H.; Quincey, Richard; Beeden, Roger; Fletcher, Cameron S.
Description
Detailed code for data wrangling (“Coral Protection Outcomes_Wrangle.Rmd”) as well as analysis and figure generation (“Coral Protection Outcomes_FinguresAnalysis.Rmd”). Outputs from the data wrangling step to be used in the analysis script are included in the “CoralProtection.Rdata” file. (ZIP)
Data Carpentry Genomics Curriculum Example Data
figshare.com
datasetcatalog.nlm.nih.gov
application/gzip
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Olivier Tenaillon; Jeffrey E Barrick; Noah Ribeck; Daniel E. Deatherage; Jeffrey L. Blanchard; Aurko Dasgupta; Gabriel C. Wu; Sébastien Wielgoss; Stéphane Cruvellier; Claudine Medigue; Dominique Schneider; Richard E. Lenski; Taylor Reiter; Jessica Mizzi; Fotis Psomopoulos; Ryan Peek; Jason Williams (2023). Data Carpentry Genomics Curriculum Example Data [Dataset]. http://doi.org/10.6084/m9.figshare.7726454.v3
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7726454.v3
Dataset updated
May 31, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Olivier Tenaillon; Jeffrey E Barrick; Noah Ribeck; Daniel E. Deatherage; Jeffrey L. Blanchard; Aurko Dasgupta; Gabriel C. Wu; Sébastien Wielgoss; Stéphane Cruvellier; Claudine Medigue; Dominique Schneider; Richard E. Lenski; Taylor Reiter; Jessica Mizzi; Fotis Psomopoulos; Ryan Peek; Jason Williams
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 16.0px 'Andale Mono'; color: #29f914; background-color: #000000} span.s1 {font-variant-ligatures: no-common-ligatures} These files are intended for use with the Data Carpentry Genomics curriculum (https://datacarpentry.org/genomics-workshop/). Files will be useful for instructors teaching this curriculum in a workshop setting, as well as individuals working through these materials on their own.

This curriculum is normally taught using Amazon Web Services (AWS). Data Carpentry maintains an AWS image that includes all of the data files needed to use these lesson materials. For information on how to set up an AWS instance from that image, see https://datacarpentry.org/genomics-workshop/setup.html. Learners and instructors who would prefer to teach on a different remote computing system can access all required files from this FigShare dataset.

This curriculum uses data from a long term evolution experiment published in 2016: Tempo and mode of genome evolution in a 50,000-generation experiment (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4988878/) by Tenaillon O, Barrick JE, Ribeck N, Deatherage DE, Blanchard JL, Dasgupta A, Wu GC, Wielgoss S, Cruveiller S, Médigue C, Schneider D, and Lenski RE. (doi: 10.1038/nature18959). All sequencing data sets are available in the NCBI BioProject database under accession number PRJNA294072 (https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA294072).

backup.tar.gz: contains original fastq files, reference genome, and subsampled fastq files. Directions for obtaining these files from public databases are given during the lesson https://datacarpentry.org/wrangling-genomics/02-quality-control/index.html). On the AWS image, these files are stored in ~/.backup directory. 1.3Gb in size.

Ecoli_metadata.xlsx: an example Excel file to be loaded during the R lesson.

shell_data.tar.gz: contains the files used as input to the Introduction to the Command Line for Genomics lesson (https://datacarpentry.org/shell-genomics/).

sub.tar.gz: contains subsampled fastq files that are used as input to the Data Wrangling and Processing for Genomics lesson (https://datacarpentry.org/wrangling-genomics/). 109Mb in size.

solutions: contains the output files of the Shell Genomics and Wrangling Genomics lessons, including fastqc output, sam, bam, bcf, and vcf files.

vcf_clean_script.R: converts vcf output in .solutions/wrangling_solutions/variant_calling_auto to single tidy data frame.

combined_tidy_vcf.csv: output of vcf_clean_script.R
n
Data from: Designing data science workshops for data-intensive environmental...
data.niaid.nih.gov
datasetcatalog.nlm.nih.gov
+1more
zip
Updated Dec 8, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Allison Theobold; Stacey Hancock; Sara Mannheimer (2020). Designing data science workshops for data-intensive environmental science research [Dataset]. http://doi.org/10.5061/dryad.7wm37pvp7
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.7wm37pvp7
Dataset updated
Dec 8, 2020
Dataset provided by
California State Polytechnic University
Montana State University
Authors
Allison Theobold; Stacey Hancock; Sara Mannheimer
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Over the last 20 years, statistics preparation has become vital for a broad range of scientific fields, and statistics coursework has been readily incorporated into undergraduate and graduate programs. However, a gap remains between the computational skills taught in statistics service courses and those required for the use of statistics in scientific research. Ten years after the publication of "Computing in the Statistics Curriculum,'' the nature of statistics continues to change, and computing skills are more necessary than ever for modern scientific researchers. In this paper, we describe research on the design and implementation of a suite of data science workshops for environmental science graduate students, providing students with the skills necessary to retrieve, view, wrangle, visualize, and analyze their data using reproducible tools. These workshops help to bridge the gap between the computing skills necessary for scientific research and the computing skills with which students leave their statistics service courses. Moreover, though targeted to environmental science graduate students, these workshops are open to the larger academic community. As such, they promote the continued learning of the computational tools necessary for working with data, and provide resources for incorporating data science into the classroom.

Methods Surveys from Carpentries style workshops the results of which are presented in the accompanying manuscript.

Pre- and post-workshop surveys for each workshop (Introduction to R, Intermediate R, Data Wrangling in R, Data Visualization in R) were collected via Google Form.

The surveys administered for the fall 2018, spring 2019 academic year are included as pre_workshop_survey and post_workshop_assessment PDF files. The raw versions of these data are included in the Excel files ending in survey_raw or assessment_raw. The data files whose name includes survey contain raw data from pre-workshop surveys and the data files whose name includes assessment contain raw data from the post-workshop assessment survey. The annotated RMarkdown files used to clean the pre-workshop surveys and post-workshop assessments are included as workshop_survey_cleaning and workshop_assessment_cleaning, respectively. The cleaned pre- and post-workshop survey data are included in the Excel files ending in clean. The summaries and visualizations presented in the manuscript are included in the analysis annotated RMarkdown file.

Cafe Sales - Dirty Data for Cleaning Training

kaggle.com

zip

Updated Jan 17, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Ahmed Mohamed (2025). Cafe Sales - Dirty Data for Cleaning Training [Dataset]. https://www.kaggle.com/datasets/ahmedmohamed2003/cafe-sales-dirty-data-for-cleaning-training

Explore at:

zip(113510 bytes)Available download formats

Dataset updated

Jan 17, 2025

Authors

Ahmed Mohamed

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Dirty Cafe Sales Dataset

Overview

The Dirty Cafe Sales dataset contains 10,000 rows of synthetic data representing sales transactions in a cafe. This dataset is intentionally "dirty," with missing values, inconsistent data, and errors introduced to provide a realistic scenario for data cleaning and exploratory data analysis (EDA). It can be used to practice cleaning techniques, data wrangling, and feature engineering.

File Information

File Name: dirty_cafe_sales.csv
Number of Rows: 10,000
Number of Columns: 8

Columns Description

Column Name	Description	Example Values
`Transaction ID`	A unique identifier for each transaction. Always present and unique.	`TXN_1234567`
`Item`	The name of the item purchased. May contain missing or invalid values (e.g., "ERROR").	`Coffee`, `Sandwich`
`Quantity`	The quantity of the item purchased. May contain missing or invalid values.	`1`, `3`, `UNKNOWN`
`Price Per Unit`	The price of a single unit of the item. May contain missing or invalid values.	`2.00`, `4.00`
`Total Spent`	The total amount spent on the transaction. Calculated as `Quantity * Price Per Unit`.	`8.00`, `12.00`
`Payment Method`	The method of payment used. May contain missing or invalid values (e.g., `None`, "UNKNOWN").	`Cash`, `Credit Card`
`Location`	The location where the transaction occurred. May contain missing or invalid values.	`In-store`, `Takeaway`
`Transaction Date`	The date of the transaction. May contain missing or incorrect values.	`2023-01-01`

Data Characteristics

Missing Values:
- Some columns (e.g., Item, Payment Method, Location) may contain missing values represented as None or empty cells.
Invalid Values:
- Some rows contain invalid entries like "ERROR" or "UNKNOWN" to simulate real-world data issues.
Price Consistency:
- Prices for menu items are consistent but may have missing or incorrect values introduced.

Menu Items

The dataset includes the following menu items with their respective price ranges:

Item	Price($)
Coffee	2
Tea	1.5
Sandwich	4
Salad	5
Cake	3
Cookie	1
Smoothie	4
Juice	3

Use Cases

This dataset is suitable for: - Practicing data cleaning techniques such as handling missing values, removing duplicates, and correcting invalid entries. - Exploring EDA techniques like visualizations and summary statistics. - Performing feature engineering for machine learning workflows.

Cleaning Steps Suggestions

To clean this dataset, consider the following steps: 1. Handle Missing Values: - Fill missing numeric values with the median or mean. - Replace missing categorical values with the mode or "Unknown."

Handle Invalid Values:
- Replace invalid entries like "ERROR" and "UNKNOWN" with NaN or appropriate values.
Date Consistency:
- Ensure all dates are in a consistent format.
- Fill missing dates with plausible values based on nearby records.
Feature Engineering:
- Create new columns, such as Day of the Week or Transaction Month, for further analysis.

License

This dataset is released under the CC BY-SA 4.0 License. You are free to use, share, and adapt it, provided you give appropriate credit.

Feedback

If you have any questions or feedback, feel free to reach out through the dataset's discussion board on Kaggle.

H
Data from: SBIR - STTR Data and Code for Collecting Wrangling and Using It
dataverse.harvard.edu
Updated Nov 5, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Grant Allard (2018). SBIR - STTR Data and Code for Collecting Wrangling and Using It [Dataset]. http://doi.org/10.7910/DVN/CKTAZX
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/CKTAZX
Dataset updated
Nov 5, 2018
Dataset provided by
Harvard Dataverse
Authors
Grant Allard
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Data set consisting of data joined for analyzing the SBIR/STTR program. Data consists of individual awards and agency-level observations. The R and python code required for pulling, cleaning, and creating useful data sets has been included. Allard_Get and Clean Data.R This file provides the code for getting, cleaning, and joining the numerous data sets that this project combined. This code is written in the R language and can be used in any R environment running R 3.5.1 or higher. If the other files in this Dataverse are downloaded to the working directory, then this Rcode will be able to replicate the original study without needing the user to update any file paths. Allard SBIR STTR WebScraper.py This is the code I deployed to multiple Amazon EC2 instances to scrape data o each individual award in my data set, including the contact info and DUNS data. Allard_Analysis_APPAM SBIR project Forthcoming Allard_Spatial Analysis Forthcoming Awards_SBIR_df.Rdata This unique data set consists of 89,330 observations spanning the years 1983 - 2018 and accounting for all eleven SBIR/STTR agencies. This data set consists of data collected from the Small Business Administration's Awards API and also unique data collected through web scraping by the author. Budget_SBIR_df.Rdata 246 observations for 20 agencies across 25 years of their budget-performance in the SBIR/STTR program. Data was collected from the Small Business Administration using the Annual Reports Dashboard, the Awards API, and an author-designed web crawler of the websites of awards. Solicit_SBIR-df.Rdata This data consists of observations of solicitations published by agencies for the SBIR program. This data was collected from the SBA Solicitations API. Primary Sources Small Business Administration. “Annual Reports Dashboard,” 2018. https://www.sbir.gov/awards/annual-reports. Small Business Administration. “SBIR Awards Data,” 2018. https://www.sbir.gov/api. Small Business Administration. “SBIR Solicit Data,” 2018. https://www.sbir.gov/api.
Data from: Attracted by higher crude protein, grasshopper abundance and...
catalog.data.gov
agdatacommons.nal.usda.gov
+1more
Updated Apr 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Data from: Attracted by higher crude protein, grasshopper abundance and offtake increase after prescribed fire [Dataset]. https://catalog.data.gov/dataset/data-from-attracted-by-higher-crude-protein-grasshopper-abundance-and-offtake-increase-aft-05822
Explore at:
Dataset updated
Apr 21, 2025
Dataset provided by
Agricultural Research Servicehttps://www.ars.usda.gov/
Description
This study examined how indirect fire effects (improved forage quality) affect the density of and offtake by grasshoppers at two different times since fire and in unburned plots. Data include total aboveground forage removal inside and outside grasshopper exclosures, crude protein content of aboveground plant material, and grasshopper densities throughout the study period. Both forage offtake and grasshopper density were significantly higher in burned plots compared to unburned plots. Burned plot grasshopper density increased over time, with greater rates of increase in recently burned plots, while density remained constant in unburned locations. These density and offtake patterns appear to be the result of higher crude protein content in burned plots, on account of them having a much higher proportion of recent growth after fire removed aboveground senesced material. Resources in this dataset:Resource Title: Rangeland fire grasshopper forage data. File Name: RangelandFireGrasshopperForage.xlsxResource Description: Excel file containing plot-level data used in formal analysis.Resource Title: Script. File Name: ScriptSupplement.pdfResource Description: R script used for data wrangling and statistical analysisResource Software Recommended: R 4.0.3,url: https://cran.r-project.org
Z
Rapid Creation of a Data Product for the World's Specimens of Horseshoe Bats...
data.niaid.nih.gov
zenodo.org
Updated Jul 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mast, Austin R.; Paul, Deborah L.; Rios, Nelson; Bruhn, Robert; Dalton, Trevor; Krimmel, Erica R.; Pearson, Katelin D.; Sherman, Aja; Shorthouse, David P.; Simmons, Nancy B.; Soltis, Pam; Upham, Nathan; Abibou, Djihbrihou (2024). Rapid Creation of a Data Product for the World's Specimens of Horseshoe Bats and Relatives, a Known Reservoir for Coronaviruses [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3974999
Explore at:
Dataset updated
Jul 18, 2024
Dataset provided by
Florida State University
University of Florida
Agriculture and Agri-Food Canada
American Museum of Natural History
Arizona State University
Yale University Peabody Museum of Natural History
Authors
Mast, Austin R.; Paul, Deborah L.; Rios, Nelson; Bruhn, Robert; Dalton, Trevor; Krimmel, Erica R.; Pearson, Katelin D.; Sherman, Aja; Shorthouse, David P.; Simmons, Nancy B.; Soltis, Pam; Upham, Nathan; Abibou, Djihbrihou
License
https://creativecommons.org/licenses/publicdomain/https://creativecommons.org/licenses/publicdomain/
Area covered
World
Description
This repository is associated with NSF DBI 2033973, RAPID Grant: Rapid Creation of a Data Product for the World's Specimens of Horseshoe Bats and Relatives, a Known Reservoir for Coronaviruses (https://www.nsf.gov/awardsearch/showAward?AWD_ID=2033973). Specifically, this repository contains (1) raw data from iDigBio (http://portal.idigbio.org) and GBIF (https://www.gbif.org), (2) R code for reproducible data wrangling and improvement, (3) protocols associated with data enhancements, and (4) enhanced versions of the dataset published at various project milestones. Additional code associated with this grant can be found in the BIOSPEX repository (https://github.com/iDigBio/Biospex). Long-term data management of the enhanced specimen data created by this project is expected to be accomplished by the natural history collections curating the physical specimens, a list of which can be found in this Zenodo resource.

Grant abstract: "The award to Florida State University will support research contributing to the development of georeferenced, vetted, and versioned data products of the world's specimens of horseshoe bats and their relatives for use by researchers studying the origins and spread of SARS-like coronaviruses, including the causative agent of COVID-19. Horseshoe bats and other closely related species are reported to be reservoirs of several SARS-like coronaviruses. Species of these bats are primarily distributed in regions where these viruses have been introduced to populations of humans. Currently, data associated with specimens of these bats are housed in natural history collections that are widely distributed both nationally and globally. Additionally, information tying these specimens to localities are mostly vague, or in many instances missing. This decreases the utility of the specimens for understanding the source, emergence, and distribution of SARS-COV-2 and similar viruses. This project will provide quality georeferenced data products through the consolidation of ancillary information linked to each bat specimen, using the extended specimen model. The resulting product will serve as a model of how data in biodiversity collections might be used to address emerging diseases of zoonotic origin. Results from the project will be disseminated widely in opensource journals, at scientific meetings, and via websites associated with the participating organizations and institutions. Support of this project provides a quality resource optimized to inform research relevant to improving our understanding of the biology and spread of SARS-CoV-2. The overall objectives are to deliver versioned data products, in formats used by the wider research and biodiversity collections communities, through an open-access repository; project protocols and code via GitHub and described in a peer-reviewed paper, and; sustained engagement with biodiversity collections throughout the project for reintegration of improved data into their local specimen data management systems improving long-term curation.

This RAPID award will produce and deliver a georeferenced, vetted and consolidated data product for horseshoe bats and related species to facilitate understanding of the sources, distribution, and spread of SARS-CoV-2 and related viruses, a timely response to the ongoing global pandemic caused by SARS-CoV-2 and an important contribution to the global effort to consolidate and provide quality data that are relevant to understanding emergent and other properties the current pandemic. This RAPID award is made by the Division of Biological Infrastructure (DBI) using funds from the Coronavirus Aid, Relief, and Economic Security (CARES) Act.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria."

Files included in this resource

9d4b9069-48c4-4212-90d8-4dd6f4b7f2a5.zip: Raw data from iDigBio, DwC-A format

0067804-200613084148143.zip: Raw data from GBIF, DwC-A format

0067806-200613084148143.zip: Raw data from GBIF, DwC-A format

1623690110.zip: Full export of this project's data (enhanced and raw) from BIOSPEX, CSV format

bionomia-datasets-attributions.zip: Directory containing 103 Frictionless Data packages for datasets that have attributions made containing Rhinolophids or Hipposiderids, each package also containing a CSV file for mismatches in person date of birth/death and specimen eventDate. File bionomia-datasets-attributions-key_2021-02-25.csv included in this directory provides a key between dataset identifier (how the Frictionless Data package files are named) and dataset name.

bionomia-problem-dates-all-datasets_2021-02-25.csv: List of 21 Hipposiderid or Rhinolophid records whose eventDate or dateIdentified mismatches a wikidata recipient’s date of birth or death across all datasets.

flagEventDate.txt: file containing term definition to reference in DwC-A

flagExclude.txt: file containing term definition to reference in DwC-A

flagGeoreference.txt: file containing term definition to reference in DwC-A

flagTaxonomy.txt: file containing term definition to reference in DwC-A

georeferencedByID.txt: file containing term definition to reference in DwC-A

identifiedByNames.txt: file containing term definition to reference in DwC-A

instructions-to-get-people-data-from-bionomia-via-datasetKey: instructions given to data providers

RAPID-code_collection-date.R: code associated with enhancing collection dates

RAPID-code_compile-deduplicate.R: code associated with compiling and deduplicating raw data

RAPID-code_external-linkages-bold.R: code associated with enhancing external linkages

RAPID-code_external-linkages-genbank.R: code associated with enhancing external linkages

RAPID-code_external-linkages-standardize.R: code associated with enhancing external linkages

RAPID-code_people.R: code associated with enhancing data about people

RAPID-code_standardize-country.R: code associated with standardizing country data

RAPID-data-dictionary.pdf: metadata about terms included in this project’s data, in PDF format

RAPID-data-dictionary.xlsx: metadata about terms included in this project’s data, in spreadsheet format

rapid-data-providers_2021-05-03.csv: list of data providers and number of records provided to rapid-joined-records_country-cleanup_2020-09-23.csv

rapid-final-data-product_2021-06-29.zip: Enhanced data from BIOSPEX, DwC-A format

rapid-final-gazetteer.zip: Gazetteer providing georeference data and metadata for 10,341 localities assessed as part of this project

rapid-joined-records_country-cleanup_2020-09-23.csv: data product initial version where raw data has been compiled and deduplicated, and country data has been standardized

RAPID-protocol_collection-date.pdf: protocol associated with enhancing collection dates

RAPID-protocol_compile-deduplicate.pdf: protocol associated with compiling and deduplicating raw data

RAPID-protocol_external-linkages.pdf: protocol associated with enhancing external linkages

RAPID-protocol_georeference.pdf: protocol associated with georeferencing

RAPID-protocol_people.pdf: protocol associated with enhancing data about people

RAPID-protocol_standardize-country.pdf: protocol associated with standardizing country data

RAPID-protocol_taxonomic-names.pdf: protocol associated with enhancing taxonomic name data

RAPIDAgentStrings1_archivedCopy_30March2021.ods: resource used in conjunction with RAPID people protocol

recordedByNames.txt: file containing term definition to reference in DwC-A

Rhinolophid-HipposideridAgentStrings_and_People2_archivedCopy_30March2021.ods: resource used in conjunction with RAPID people protocol

wikidata-notes-for-bat-collectors_leachman_2020: please see https://zenodo.org/record/4724139 for this resource
d
Data from: Mesocosm experiment reveals scale-dependence of movement...
datadryad.org
data.niaid.nih.gov
+1more
zip
Updated Mar 27, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aparajitha Ramesh; Jakob Gismann; T. G. G. Groothuis; F. J. Weissing; Marion Nicolaus (2023). Mesocosm experiment reveals scale-dependence of movement tendencies in sticklebacks [Dataset]. http://doi.org/10.5061/dryad.xwdbrv1hx
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.xwdbrv1hx
Dataset updated
Mar 27, 2023
Dataset provided by
Dryad
Authors
Aparajitha Ramesh; Jakob Gismann; T. G. G. Groothuis; F. J. Weissing; Marion Nicolaus
Time period covered
Mar 13, 2023
Description
Datasets can be opened as excel sheets or google sheets. Code can be opened in R.
Lahman Baseball Database
kaggle.com
zip
Updated Jul 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dalya S (2025). Lahman Baseball Database [Dataset]. https://www.kaggle.com/datasets/dalyas/lahman-baseball-database
Explore at:
zip(9971692 bytes)Available download formats
Dataset updated
Jul 20, 2025
Authors
Dalya S
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
The Lahman Baseball Database is a comprehensive, open-source compilation of statistics and player data for Major League Baseball (MLB). It contains relational data from the 19th century through the most recent complete season, including batting, pitching, and fielding statistics, player demographics, awards, team performance, and managerial records.

This dataset is widely used for exploratory data analysis, statistical modeling, predictive analysis, machine learning, and sports performance forecasting.

This dataset is the latest CSV release of the Lahman Baseball Database, downloaded directly from https://sabr.org/lahman-database/. It includes historical MLB data spanning from 1871 to 2024, organized across 27 structured tables such as: - Batting: Player-level batting stats per year - Pitching: Season-level metrics - People: Biographical data (birth/death, handedness, debut/finalGame) - Teams, Managers: Team records - BattingPost, PitchingPost, FieldingPost: Post-season stats - AllstarFull: all star game - statsHallOfFame: Historical awards and recognitions

Items to explore: - Track league-wide trends in home runs, strikeouts, or batting averages over time - Compare player performance by era, position, or righty/lefty - Create a timeline showing changes in a teams win-loss records - Map birthplace distributions of MLB players over time - Estimate the impact of rule changes on player stats (pitch clock, DH) - Model factors that influence MVP or Cy Young award wins - Predict a players future performance based on historical stats

📘 License

This dataset is released under the Creative Commons Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0) license. Attribution is required. Derivative works must be shared under the same license.

📝 Official source: https://sabr.org/lahman-database/ 📥 Direct data page: https://www.seanlahman.com/baseball-archive/statistics/ 🖊️ R-Package Documentation: https://cran.r-project.org/web/packages/Lahman/Lahman.pdf

0.1 Copyright Notice & Limited Use License This database is copyright 1996-2025 by SABR, via generious donation from Sean Lahman. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. For details see: http://creativecommons.org/licenses/by-sa/3.0/ For licensing information or further information, contact Scott Bush at: sbush@sabr.org 0.2 Contact Information Web site: https://sabr.org/lahman-database/ E-Mail: jpomrenke@sabr.org
Explore Bike Share Data
kaggle.com
zip
Updated Jun 3, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaltout (2021). Explore Bike Share Data [Dataset]. https://www.kaggle.com/shaltout/explore-bike-share-data
Explore at:
zip(26232124 bytes)Available download formats
Dataset updated
Jun 3, 2021
Authors
Shaltout
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Bike Share Data Over the past decade, bicycle-sharing systems have been growing in number and popularity in cities across the world. Bicycle-sharing systems allow users to rent bicycles on a very short-term basis for a price. This allows people to borrow a bike from point A and return it at point B, though they can also return it to the same location if they'd like to just go for a ride. Regardless, each bike can serve several users per day.

Thanks to the rise in information technologies, it is easy for a user of the system to access a dock within the system to unlock or return bicycles. These technologies also provide a wealth of data that can be used to explore how these bike-sharing systems are used.

In this project, you will use data provided by Motivate, a bike share system provider for many major cities in the United States, to uncover bike share usage patterns. You will compare the system usage between three large cities: Chicago, New York City, and Washington, DC.

The Datasets Randomly selected data for the first six months of 2017 are provided for all three cities. All three of the data files contain the same core six (6) columns:

Start Time (e.g., 2017-01-01 00:07:57) End Time (e.g., 2017-01-01 00:20:53) Trip Duration (in seconds - e.g., 776) Start Station (e.g., Broadway & Barry Ave) End Station (e.g., Sedgwick St & North Ave) User Type (Subscriber or Customer) The Chicago and New York City files also have the following two columns:

Gender Birth Year

Data for the first 10 rides in the new_york_city.csv file

The original files are much larger and messier, and you don't need to download them, but they can be accessed here if you'd like to see them (Chicago, New York City, Washington). These files had more columns and they differed in format in many cases. Some data wrangling has been performed to condense these files to the above core six columns to make your analysis and the evaluation of your Python skills more straightforward. In the Data Wrangling course that comes later in the Data Analyst Nanodegree program, students learn how to wrangle the dirtiest, messiest datasets, so don't worry, you won't miss out on learning this important skill!

Statistics Computed You will learn about bike share use in Chicago, New York City, and Washington by computing a variety of descriptive statistics. In this project, you'll write code to provide the following information:

1 Popular times of travel (i.e., occurs most often in the start time)

most common month most common day of week most common hour of day

2 Popular stations and trip

most common start station most common end station most common trip from start to end (i.e., most frequent combination of start station and end station)

3 Trip duration

total travel time average travel time

4 User info

counts of each user type counts of each gender (only available for NYC and Chicago) earliest, most recent, most common year of birth (only available for NYC and Chicago) The Files To answer these questions using Python, you will need to write a Python script. To help guide your work in this project, a template with helper code and comments is provided in a bikeshare.py file, and you will do your scripting in there also. You will need the three city dataset files too:

chicago.csv new_york_city.csv washington.csv

All four of these files are zipped up in the Bikeshare file in the resource tab in the sidebar on the left side of this page. You may download and open up that zip file to do your project work on your local machine.
d
National-scale, remotely sensed lake trophic status 1984-2020
catalog.data.gov
data.usgs.gov
+2more
Updated Nov 27, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). National-scale, remotely sensed lake trophic status 1984-2020 [Dataset]. https://catalog.data.gov/dataset/national-scale-remotely-sensed-lake-trophic-status-1984-2020
Explore at:
Dataset updated
Nov 27, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
Lake trophic status is a key water quality property that integrates a lake's physical, chemical, and biological processes. Despite the importance of trophic status as a gauge of lake water quality, standardized and machine readable observations are uncommon. Remote sensing presents an opportunity to detect and analyze lake trophic status with reproducible, robust methods across time and space. We used Landsat surface reflectance and lake morphometric data to create the first compendium of lake trophic status for more than 56,000 lakes of at least 10 ha in size throughout the contiguous United States from 1984 through 2020. The dataset was constructed with FAIR data principles (Findable, Accessible, Interoperable, and Reproducible) in mind, where data are publicly available, relational keys from parent datasets are retained, and all data wrangling and modeling routines are scripted for future reuse. Together, this resource offers critical data to address basic and applied research questions about lake water quality at a suite of spatial and temporal scales.
Measles Immunization Rates in US Schools
kaggle.com
zip
Updated Feb 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jesse Mostipak (2020). Measles Immunization Rates in US Schools [Dataset]. https://www.kaggle.com/jessemostipak/measles-immunization-rates-in-us-schools
Explore at:
zip(1567400 bytes)Available download formats
Dataset updated
Feb 24, 2020
Authors
Jesse Mostipak
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Area covered
United States
Description
Context

This data set contains measles vaccination rate data for 46,412 schools in 32 states across the US.

Content

Vaccination rates are for the 2017-201818 school year for the following states: - Colorado - Connecticut - Minnesota - Montana - New Jersey - New York - North Dakota - Pennsylvania - South Dakota - Utah - Washington

Rates for other states are for the time period 2018-2019.

The data was compiled by The Wall Street Journal.

Acknowledgements

The data was originally compiled by The Wall Street Journal, and then downloaded and wrangled by the TidyTuesday community. The R code used for wrangling can be accessed here.

Inspiration

Please remember that you are welcome to explore beyond the provided data set, but the data is provided as a "toy" data set to practice techniques on. The data may require additional cleaning and wrangling!
Z
Data and Analysis for Kaplanis, Denny, and Raimondi 2024, "Vertical...
data.niaid.nih.gov
Updated Sep 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaplanis, Nikolas (2024). Data and Analysis for Kaplanis, Denny, and Raimondi 2024, "Vertical distribution of rocky intertidal organisms shifts with sea-level variability on the Northeast Pacific Coast". [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11186903
Explore at:
Dataset updated
Sep 30, 2024
Dataset provided by
University of California, Santa Cruz
Authors
Kaplanis, Nikolas
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This repository contains all the data and R scripts used to produce all analyses and figures for Kaplanis, Denny, and Raimondi 2024, as well as all intermediate outputs and final figures. To access this content, download and unzip the intertidalvertdist folder (for intertidal vertical distribution). The R Project is titled "intertidalvertdist". All pertinent information needed to access data, replicate the analyses, and produce figures is contained within the README file, but a brief desciption is below.

Directory Architecture:

Data:Contains all data. Within this folder are two subdirectories - Raw Data, and Processed Data. Raw Data are unmanipulated, straight from the data source. Processed Data are outputs from scripted data wrangling and transformations.

Within each of these folders are two more subdirectories: Tide Gauge Data, and MARINe Data. These are the two data sources used in this manuscript - monthly sea-level data from The National Oceanic and Atmospheric Administration Center for Operational Oceanographic Products and Services (NOAA CO-OPS) tide gauge stations, and long-term rocky intertidal biological monitoring data from Multi-Agency Rocky Intertidal Network (MARINe) survey sites.

Scripts:All R scripts are contained within the Scripts folder. The scripts have the prefix IVD (for intertidal vertical distribution), then a name that indicates the major function of the code. The scripts either downloads data, manipulates data, conducts analyses, and/or produces a figure.

Outputs:Any figures and tables from preliminary analyses, but that are not used in the final manuscript, are saved in Outputs.

Figures:All final figures and tables are contained in the Figures folder. All figures are produced by scripts, except Figs. 1 and 2, which are schematics produced manually in a graphics editor. This folder contains two other folders: Supplemenatary Figures, and Partial Regression Plots. Partial Regression plots are the same as the final Figures 8-12, except they are grouped by taxa rather than by explanatory variable.

Data Processing Workflow - Overview: Tide Gauge Data (Data/Raw Data/Tide Gauge Data/individual stations) were downloaded using the NOAA Co-Ops API URL Builder (https://tidesandcurrents.noaa.gov/api-helper/url-generator.html), merged, then analyzed. Three MARINe data sets from the Coastal Biodiversity Survey (CBS) were accessed via data requests (https://marine.ucsc.edu/explore-the-data/contact/data-request-form.html). The first MARINe dataset (Data/Raw Data/MARINe Data/CBS_Percent Cover Data, both First Sample and Full Sample) was used to determine the top ten most abundant taxa (hereafter termed “dominant taxa”) across CBS survey sites during the monitoring period of 2001-01-01 to 2021-09-30. The second MARINe dataset (Data/Raw Data/MARINe Data/CBS_Elevation Data) was used to describe the upper limits of vertical distribution of dominant taxa through time. The third MARINe dataset (Data/Raw Data/MARINe Data/CBS_Presence Data) was used to visualize latitudinal distribution of taxa.

Location information for Tide Gauge Stations and CBS Survey Sites were assembled into a table (Data/Raw Data/CBS_Tide Gauge_Data.csv)

Tide Gauge Data were processed first, then MARINe Data. To replicate this workflow follow the steps described in the README file, in order.
Wrangling Phosphoproteomic Data to Elucidate Cancer Signaling Pathways
plos.figshare.com
pdf
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mark L. Grimes; Wan-Jui Lee; Laurens van der Maaten; Paul Shannon (2023). Wrangling Phosphoproteomic Data to Elucidate Cancer Signaling Pathways [Dataset]. http://doi.org/10.1371/journal.pone.0052884
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0052884
Dataset updated
May 31, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Mark L. Grimes; Wan-Jui Lee; Laurens van der Maaten; Paul Shannon
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The interpretation of biological data sets is essential for generating hypotheses that guide research, yet modern methods of global analysis challenge our ability to discern meaningful patterns and then convey results in a way that can be easily appreciated. Proteomic data is especially challenging because mass spectrometry detectors often miss peptides in complex samples, resulting in sparsely populated data sets. Using the R programming language and techniques from the field of pattern recognition, we have devised methods to resolve and evaluate clusters of proteins related by their pattern of expression in different samples in proteomic data sets. We examined tyrosine phosphoproteomic data from lung cancer samples. We calculated dissimilarities between the proteins based on Pearson or Spearman correlations and on Euclidean distances, whilst dealing with large amounts of missing data. The dissimilarities were then used as feature vectors in clustering and visualization algorithms. The quality of the clusterings and visualizations were evaluated internally based on the primary data and externally based on gene ontology and protein interaction networks. The results show that t-distributed stochastic neighbor embedding (t-SNE) followed by minimum spanning tree methods groups sparse proteomic data into meaningful clusters more effectively than other methods such as k-means and classical multidimensional scaling. Furthermore, our results show that using a combination of Spearman correlation and Euclidean distance as a dissimilarity representation increases the resolution of clusters. Our analyses show that many clusters contain one or more tyrosine kinases and include known effectors as well as proteins with no known interactions. Visualizing these clusters as networks elucidated previously unknown tyrosine kinase signal transduction pathways that drive cancer. Our approach can be applied to other data types, and can be easily adopted because open source software packages are employed.
Malaysia Covid-19 Dataset
kaggle.com
zip
Updated Jul 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TanKY (2021). Malaysia Covid-19 Dataset [Dataset]. https://www.kaggle.com/datasets/yeanzc/malaysia-covid19-dataset/discussion
Explore at:
zip(32611 bytes)Available download formats
Dataset updated
Jul 20, 2021
Authors
TanKY
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
Malaysia
Description
A free, publicly available Malaysia Covid-19 dataset.

Data Descriptions

28 variables. Include:

New case New case (7 day rolling average) Recovered Active case Local cases Imported case ICU Death Cumulative deaths

People tested Cumulative people tested Positivity rate Positivity rate (7 day rolling average)

Data Sources

Column 1 to 22 are Twitter data, which the Tweets are retrieved from Health DG @DGHisham timeline with Twitter API. A typical covid situation update Tweet is written in a relatively fixed format. Data wrangling are done in Python/Pandas, numerical values extracted with Regular Expression (RegEx). Missing data are added manually from Desk of DG (kpkesihatan).

Column 23 ['remark'] is my own written remark regarding the Tweet status/content.

Column 24 ['Cumulative people tested'] data is transcribed from an image on MOH COVID-19 website. Specifically, the first image under TABURAN KES section in each Situasi Terkini daily webpage of http://covid-19.moh.gov.my/terkini. If missing, the image from CPRC KKM Telegram or KKM Facebook Live video is used. Data in this column, dated from 1 March 2020 to 11 Feb 2021, are from Our World in Data, their data collection method as stated here.

Why does this dataset exist?

MOH does not publish any covid data in csv/excel format as of today, they provide the data as is, along with infographics that are hardly informative. In an undisclosed email, MOH doesn't seem to understand my request for them to release the covid public health data for anyone to download and do their analysis if they do wish.

To be updated periodically

A simple visualization dashboard is now published on Tableau Public. It's is updated daily. Do check it out! More charts to be added in the near future

Inspiration

Create better visualizations to help fellow Malaysians understand the Covid-19 situation. Empower the data science community.
Cyclistic_Divvy_data
kaggle.com
zip
Updated Jun 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rami Ghaith (2023). Cyclistic_Divvy_data [Dataset]. https://www.kaggle.com/datasets/ramighaith/cyclistic-divvy-data
Explore at:
zip(21440758 bytes)Available download formats
Dataset updated
Jun 11, 2023
Authors
Rami Ghaith
Description
The following data shows riding information for members vs casual riders at the company Cyclistic(made up name). This is a dataset used as a case study for the google data analytics certificate.

The Changes Done to the Data in Excel: - Removed all duplicated (none were found) - Added a ride_length column by subtracting ended_at by started_at using the following formula "=C2-B2" and then turned that type into a Time, 37:30:55 - Added a day_of_week column using the following formula "=WEEKDAY(B2,1)" to display the day the ride took place on, 1= sunday through 7=saturday. - There was data that can be seen as ########, that data was left the same with no changes done to it, this data simply represents negative data and should just be looked at as 0.

Processing the Data in RStudio: - Installed required packages such as tidyverse for data import and wrangling, lubridate for date functions and ggplot for visualization. - Step 1: I read the csv files into R to collect the data - Step 2: Made sure the data all contained the same column names because I want to merge them into one - Step 3: Renamed all column names to make sure they align, then merged them into one combined data - Step 4: More data cleaning and analyzing - Step 5: Once my data was cleaned and clearly telling a story, I began to visualize it. The visualizations done can be seen below.
Census Income dataset
kaggle.com
zip
Updated Oct 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
tawfik elmetwally (2023). Census Income dataset [Dataset]. https://www.kaggle.com/datasets/tawfikelmetwally/census-income-dataset
Explore at:
zip(707150 bytes)Available download formats
Dataset updated
Oct 28, 2023
Authors
tawfik elmetwally
Description
This intermediate level data set was extracted from the census bureau database. There are 48842 instances of data set, mix of continuous and discrete (train=32561, test=16281).

The data set has 15 attribute which include age, sex, education level and other relevant details of a person. The data set will help to improve your skills in Exploratory Data Analysis, Data Wrangling, Data Visualization and Classification Models.

Feel free to explore the data set with multiple supervised and unsupervised learning techniques. The Following description gives more details on this data set:

age: the age of an individual.

workclass: The type of work or employment of an individual. It can have the following categories:

Private: Working in the private sector.

Self-emp-not-inc: Self-employed individuals who are not incorporated.

Self-emp-inc: Self-employed individuals who are incorporated.

Federal-gov: Working for the federal government.

Local-gov: Working for the local government.

State-gov: Working for the state government.

Without-pay: Not working and without pay.

Never-worked: Never worked before.

Final Weight: The weights on the CPS files are controlled to independent estimates of the civilian noninstitutional population of the US. These are prepared monthly for us by Population Division here at the Census Bureau. We use 3 sets of controls.

These are: 1. A single cell estimate of the population 16+ for each state. 2. Controls for Hispanic Origin by age and sex. 3. Controls by Race, age and sex.

We use all three sets of controls in our weighting program and "rake" through them 6 times so that by the end we come back to all the controls we used.

People with similar demographic characteristics should have similar weights. There is one important caveat to remember about this statement. That is that since the CPS sample is actually a collection of 51 state samples, each with its own probability of selection, the statement only applies within state.

education: The highest level of education completed.

education-num: The number of years of education completed.

marital-status: The marital status.

occupation: Type of work performed by an individual.

relationship: The relationship status.

race: The race of an individual.

sex: The gender of an individual.

capital-gain: The amount of capital gain (financial profit).

capital-loss: The amount of capital loss an individual has incurred.

hours-per-week: The number of hours works per week.

native-country: The country of origin or the native country.

income: The income level of an individual and serves as the target variable. It indicates whether the income is greater than $50,000 or less than or equal to $50,000, denoted as (>50K, <=50K).
US Perspectives on Abortion (1975-2022)
kaggle.com
zip
Updated Feb 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Data Wrangler (2023). US Perspectives on Abortion (1975-2022) [Dataset]. https://www.kaggle.com/datasets/justin2028/perspectives-on-abortion-1975-2022
Explore at:
zip(5970 bytes)Available download formats
Dataset updated
Feb 4, 2023
Authors
The Data Wrangler
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12064410%2F66268a15513f366aeeea35cc69051c2b%2Fabortion%20flag.png?generation=1675499090398375&alt=media" alt="">

DAY 17,196 (April 4, 1975 - May 2, 2022)

This is a dataset that describes the changing perspectives on a woman's right to choose in America.

All data are official figures from Gallup, formerly known as the American Institute of Public Opinion, that have been compiled and structured by myself. I decided to create this dataset to explore how American views on abortion have evolved in the decades preceding the Dobbs V. Jackson Women's Health Organization decision. In that particular court case, the Supreme Court also overturned the 1973 Roe V. Wade decision, ending the long-standing constitutional right to abortion in the United States. Abortion has become a fixture of American politics in recent years, but few are able to take a bipartisan stance on the issue. I hope that the objectiveness of the quantitative data in this dataset can allow for a more rational understanding of the issue.

Data Sources

The primary data source used was Gallup, an analytics company renowned for its expansive public opinion polls. With decades worth of data on the sentiments of Americans, Gallup is an organization uniquely equipped to illustrate the shifting attitudes on abortion.

Gallup's Historical Trends On Abortion - Gallup publicly released not only their opinion data about abortion, but also attached corresponding visualizations that conveyed to me the analytical potential of a compiled dataset.

Gallup's "Where Do Americans Stand On Abortion?" - While searching for quality data sources regarding abortion (ex. MMWR, CDC, PMSS), I found this Gallup article, which piqued my interest and led me to their "Historical Trends" page.

Reponses Being Tracked

Q1꞉ Should Abortions Be Legal?

Q2꞉ Are You Pro-Choice Or Pro-Life (AFTER)?

Q2꞉ Are You Pro-Choice Or Pro-Life (NOT AFTER)?

Q3꞉ Should Roe V. Wade Be Overturned Or Not?

Q4꞉ Which Stage of Pregnancy Should Abortion Be Legal Or Illegal?

Q5꞉ How Does A Political Candidateʼs Position On Abortion Affect Your Vote?

Q6꞉ Is Abortion Morally Acceptable Or Wrong?

Q7꞉ Are You Satisfied Or Dissatisfied With Nationʼs Policies On Abortion?

Dataset History

2023-02-04 - Dataset is created (17,474 days after the coverage start date).

GitHub Repository - The same data but on GitHub.

Code Starter

Link to Notebook

Acknowledgements

The idea for this dataset came from Ms. Katlen, my English teacher. A big thank you to her for the suggestion to explore how perspectives have changed about a woman's right to choose :)

Facebook

Twitter

Click to copy link

Link copied

Cite

Technavio (2024). Data Wrangling Market Analysis North America, Europe, APAC, Middle East and Africa, South America - US, UK, Germany, China, Japan - Size and Forecast 2024-2028 [Dataset]. https://www.technavio.com/report/data-wrangling-market-industry-analysis

Data Wrangling Market Analysis North America, Europe, APAC, Middle East and Africa, South America - US, UK, Germany, China, Japan - Size and Forecast 2024-2028

Explore at:

pdfAvailable download formats

Dataset updated

Oct 4, 2024

Dataset provided by

TechNavio

Authors

Technavio

License

https://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice

Time period covered

2024 - 2028

Area covered

United Kingdom, United States

Description

Snapshot img

Data Wrangling Market Size 2024-2028

The data wrangling market size is forecast to increase by USD 1.4 billion at a CAGR of 14.8% between 2023 and 2028. The market is experiencing significant growth due to the numerous benefits provided by data wrangling solutions, including data cleaning, transformation, and enrichment. One major trend driving market growth is the rising need for technology such as the competitive intelligence and artificial intelligence in the healthcare sector, where data wrangling is essential for managing and analyzing patient data to improve patient outcomes and reduce costs. However, a challenge facing the market is the lack of awareness of data wrangling tools among small and medium-sized enterprises (SMEs), which limits their ability to effectively manage and utilize their data. Despite this, the market is expected to continue growing as more organizations recognize the value of data wrangling in driving business insights and decision-making.

What will be the Size of the Market During the Forecast Period?

Request Free Sample

The market is experiencing significant growth due to the increasing demand for data management and analysis in various industries. The market is experiencing significant growth due to the increasing volume, variety, and velocity of data being generated from various sources such as IoT devices, financial services, and smart cities. Artificial intelligence and machine learning technologies are being increasingly used for data preparation, data cleaning, and data unification. Data wrangling, also known as data munging, is the process of cleaning, transforming, and enriching raw data to make it usable for analysis. This process is crucial for businesses aiming to gain valuable insights from their data and make informed decisions. Data analytics is a primary driver for the market, as organizations seek to extract meaningful insights from their data. Cloud solutions are increasingly popular for data wrangling due to their flexibility, scalability, and cost-effectiveness.

Furthermore, both on-premises and cloud-based solutions are being adopted by businesses to meet their specific data management requirements. Multi-cloud strategies are also gaining traction in the market, as organizations seek to leverage the benefits of multiple cloud providers. This approach allows businesses to distribute their data across multiple clouds, ensuring business continuity and disaster recovery capabilities. Data quality is another critical factor driving the market. Ensuring data accuracy, completeness, and consistency is essential for businesses to make reliable decisions. The market is expected to grow further as organizations continue to invest in big data initiatives and implement advanced technologies such as AI and ML to gain a competitive edge. Data cleaning and data unification are key processes in data wrangling that help improve data quality. The finance and insurance industries are major contributors to the market, as they generate vast amounts of data daily.

In addition, real-time analysis is becoming increasingly important in these industries, as businesses seek to gain insights from their data in near real-time to make informed decisions. The Internet of Things (IoT) is also driving the market, as businesses seek to collect and analyze data from IoT devices to gain insights into their operations and customer behavior. Edge computing is becoming increasingly popular for processing IoT data, as it allows for faster analysis and decision-making. Self-service data preparation is another trend in the market, as businesses seek to empower their business users to prepare their data for analysis without relying on IT departments.

Moreover, this approach allows businesses to be more agile and responsive to changing business requirements. Big data is another significant trend in the market, as businesses seek to manage and analyze large volumes of data to gain insights into their operations and customer behavior. Data wrangling is a critical process in managing big data, as it ensures that the data is clean, transformed, and enriched to make it usable for analysis. In conclusion, the market in North America is experiencing significant growth due to the increasing demand for data management and analysis in various industries. Cloud solutions, multi-cloud strategies, data quality, finance and insurance, IoT, real-time analysis, self-service data preparation, and big data are some of the key trends driving the market. Businesses that invest in data wrangling solutions can gain a competitive edge by gaining valuable insights from their data and making informed decisions.

Market Segmentation

The market research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD billion' for the period 2024-2028, as well as historical data from 2018-2022 for the following segments.

Sector

Clear search

Close search

Google apps

Main menu

Data Wrangling Market Analysis North America, Europe, APAC, Middle East and...

Snapshot img

Enriched NYTimes COVID19 U.S. County Dataset

Overview and Inspiration

How this data can be used

Content Details

Visualizations and Analysis Examples

Other Data Notes

Acknowledgements

Data and R code for analysis.

Data Carpentry Genomics Curriculum Example Data

Data from: Designing data science workshops for data-intensive environmental...

Cafe Sales - Dirty Data for Cleaning Training

Dirty Cafe Sales Dataset

Overview

File Information

Columns Description

Data Characteristics

Menu Items

Use Cases

Cleaning Steps Suggestions

License

Feedback

Data from: SBIR - STTR Data and Code for Collecting Wrangling and Using It

Data from: Attracted by higher crude protein, grasshopper abundance and...

Rapid Creation of a Data Product for the World's Specimens of Horseshoe Bats...

Data from: Mesocosm experiment reveals scale-dependence of movement...

Lahman Baseball Database

Explore Bike Share Data

1 Popular times of travel (i.e., occurs most often in the start time)

2 Popular stations and trip

3 Trip duration

4 User info

National-scale, remotely sensed lake trophic status 1984-2020

Measles Immunization Rates in US Schools

Context

Content

Acknowledgements

Inspiration

Data and Analysis for Kaplanis, Denny, and Raimondi 2024, "Vertical...

Wrangling Phosphoproteomic Data to Elucidate Cancer Signaling Pathways

Malaysia Covid-19 Dataset

A free, publicly available Malaysia Covid-19 dataset.

Data Descriptions

28 variables. Include:

Data Sources

Why does this dataset exist?

To be updated periodically

Inspiration

Cyclistic_Divvy_data

Census Income dataset

US Perspectives on Abortion (1975-2022)

DAY 17,196 (April 4, 1975 - May 2, 2022)

Data Sources

The primary data source used was Gallup, an analytics company renowned for its expansive public opinion polls. With decades worth of data on the sentiments of Americans, Gallup is an organization uniquely equipped to illustrate the shifting attitudes on abortion.

Reponses Being Tracked

Dataset History

Code Starter

Acknowledgements

Data Wrangling Market Analysis North America, Europe, APAC, Middle East and Africa, South America - US, UK, Germany, China, Japan - Size and Forecast 2024-2028

Snapshot img