100+ datasets found

Scooter Sales - Excel Project
kaggle.com
Updated Jun 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ann Truong (2023). Scooter Sales - Excel Project [Dataset]. https://www.kaggle.com/datasets/bvanntruong/scooter-sales-excel-project
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 8, 2023
Dataset provided by
Kaggle
Authors
Ann Truong
Description
The link for the Excel project to download can be found on GitHub here. It includes the raw data, Pivot Tables, and an interactive dashboard with Pivot Charts and Slicers. The project also includes business questions and the formulas I used to answer. The image below is included for ease. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12904052%2F61e460b5f6a1fa73cfaaa33aa8107bd5%2FBusinessQuestions.png?generation=1686190703261971&alt=media" alt=""> The link for the Tableau adjusted dashboard can be found here.

A screenshot of the interactive Excel dashboard is also included below for ease. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12904052%2Fe581f1fce8afc732f7823904da9e4cce%2FScooter%20Dashboard%20Image.png?generation=1686190815608343&alt=media" alt="">
Data from: Current and projected research data storage needs of Agricultural...
catalog.data.gov
agdatacommons.nal.usda.gov
+2more
Updated Apr 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Current and projected research data storage needs of Agricultural Research Service researchers in 2016 [Dataset]. https://catalog.data.gov/dataset/current-and-projected-research-data-storage-needs-of-agricultural-research-service-researc-f33da
Explore at:
Dataset updated
Apr 21, 2025
Dataset provided by
Agricultural Research Servicehttps://www.ars.usda.gov/
Description
The USDA Agricultural Research Service (ARS) recently established SCINet , which consists of a shared high performance computing resource, Ceres, and the dedicated high-speed Internet2 network used to access Ceres. Current and potential SCINet users are using and generating very large datasets so SCINet needs to be provisioned with adequate data storage for their active computing. It is not designed to hold data beyond active research phases. At the same time, the National Agricultural Library has been developing the Ag Data Commons, a research data catalog and repository designed for public data release and professional data curation. Ag Data Commons needs to anticipate the size and nature of data it will be tasked with handling. The ARS Web-enabled Databases Working Group, organized under the SCINet initiative, conducted a study to establish baseline data storage needs and practices, and to make projections that could inform future infrastructure design, purchases, and policies. The SCINet Web-enabled Databases Working Group helped develop the survey which is the basis for an internal report. While the report was for internal use, the survey and resulting data may be generally useful and are being released publicly. From October 24 to November 8, 2016 we administered a 17-question survey (Appendix A) by emailing a Survey Monkey link to all ARS Research Leaders, intending to cover data storage needs of all 1,675 SY (Category 1 and Category 4) scientists. We designed the survey to accommodate either individual researcher responses or group responses. Research Leaders could decide, based on their unit's practices or their management preferences, whether to delegate response to a data management expert in their unit, to all members of their unit, or to themselves collate responses from their unit before reporting in the survey. Larger storage ranges cover vastly different amounts of data so the implications here could be significant depending on whether the true amount is at the lower or higher end of the range. Therefore, we requested more detail from "Big Data users," those 47 respondents who indicated they had more than 10 to 100 TB or over 100 TB total current data (Q5). All other respondents are called "Small Data users." Because not all of these follow-up requests were successful, we used actual follow-up responses to estimate likely responses for those who did not respond. We defined active data as data that would be used within the next six months. All other data would be considered inactive, or archival. To calculate per person storage needs we used the high end of the reported range divided by 1 for an individual response, or by G, the number of individuals in a group response. For Big Data users we used the actual reported values or estimated likely values. Resources in this dataset:Resource Title: Appendix A: ARS data storage survey questions. File Name: Appendix A.pdfResource Description: The full list of questions asked with the possible responses. The survey was not administered using this PDF but the PDF was generated directly from the administered survey using the Print option under Design Survey. Asterisked questions were required. A list of Research Units and their associated codes was provided in a drop down not shown here. Resource Software Recommended: Adobe Acrobat,url: https://get.adobe.com/reader/ Resource Title: CSV of Responses from ARS Researcher Data Storage Survey. File Name: Machine-readable survey response data.csvResource Description: CSV file includes raw responses from the administered survey, as downloaded unfiltered from Survey Monkey, including incomplete responses. Also includes additional classification and calculations to support analysis. Individual email addresses and IP addresses have been removed. This information is that same data as in the Excel spreadsheet (also provided).Resource Title: Responses from ARS Researcher Data Storage Survey. File Name: Data Storage Survey Data for public release.xlsxResource Description: MS Excel worksheet that Includes raw responses from the administered survey, as downloaded unfiltered from Survey Monkey, including incomplete responses. Also includes additional classification and calculations to support analysis. Individual email addresses and IP addresses have been removed.Resource Software Recommended: Microsoft Excel,url: https://products.office.com/en-us/excel
HR Analytics Dataset
kaggle.com
zip
Updated Oct 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
anshika2301 (2023). HR Analytics Dataset [Dataset]. https://www.kaggle.com/datasets/anshika2301/hr-analytics-dataset
Explore at:
zip(213690 bytes)Available download formats
Dataset updated
Oct 27, 2023
Authors
anshika2301
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
HR analytics, also referred to as people analytics, workforce analytics, or talent analytics, involves gathering together, analyzing, and reporting HR data. It is the collection and application of talent data to improve critical talent and business outcomes. It enables your organization to measure the impact of a range of HR metrics on overall business performance and make decisions based on data. They are primarily responsible for interpreting and analyzing vast datasets.

Download the data CSV files here ; https://drive.google.com/drive/folders/18mQalCEyZypeV8TJeP3SME_R6qsCS2Og
Supply Chain DataSet
kaggle.com
zip
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amir Motefaker (2023). Supply Chain DataSet [Dataset]. https://www.kaggle.com/datasets/amirmotefaker/supply-chain-dataset
Explore at:
zip(9340 bytes)Available download formats
Dataset updated
Jun 1, 2023
Authors
Amir Motefaker
Description
Supply chain analytics is a valuable part of data-driven decision-making in various industries such as manufacturing, retail, healthcare, and logistics. It is the process of collecting, analyzing and interpreting data related to the movement of products and services from suppliers to customers.
B
Data Cleaning Sample
borealisdata.ca
dataone.org
Updated Jul 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP3/ZCN177
Dataset updated
Jul 13, 2023
Dataset provided by
Borealis
Authors
Rong Luo
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Sample data for exercises in Further Adventures in Data Cleaning.
CSV file used in statistical analyses
data.csiro.au
researchdata.edu.au
+1more
Updated Oct 13, 2014
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CSIRO (2014). CSV file used in statistical analyses [Dataset]. http://doi.org/10.4225/08/543B4B4CA92E6
Explore at:
Unique identifier
https://doi.org/10.4225/08/543B4B4CA92E6
Dataset updated
Oct 13, 2014
Dataset authored and provided by
CSIROhttp://www.csiro.au/
License
https://research.csiro.au/dap/licences/csiro-data-licence/https://research.csiro.au/dap/licences/csiro-data-licence/
Time period covered
Mar 14, 2008 - Jun 9, 2009
Dataset funded by
CSIROhttp://www.csiro.au/
Description
A csv file containing the tidal frequencies used for statistical analyses in the paper "Estimating Freshwater Flows From Tidally-Affected Hydrographic Data" by Dan Pagendam and Don Percival.
18 excel spreadsheets by species and year giving reproduction and growth...
catalog.data.gov
data.wu.ac.at
Updated Aug 17, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2024). 18 excel spreadsheets by species and year giving reproduction and growth data. One excel spreadsheet of herbicide treatment chemistry. [Dataset]. https://catalog.data.gov/dataset/18-excel-spreadsheets-by-species-and-year-giving-reproduction-and-growth-data-one-excel-sp
Explore at:
Dataset updated
Aug 17, 2024
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
Excel spreadsheets by species (4 letter code is abbreviation for genus and species used in study, year 2010 or 2011 is year data collected, SH indicates data for Science Hub, date is date of file preparation). The data in a file are described in a read me file which is the first worksheet in each file. Each row in a species spreadsheet is for one plot (plant). The data themselves are in the data worksheet. One file includes a read me description of the column in the date set for chemical analysis. In this file one row is an herbicide treatment and sample for chemical analysis (if taken). This dataset is associated with the following publication: Olszyk , D., T. Pfleeger, T. Shiroyama, M. Blakely-Smith, E. Lee , and M. Plocher. Plant reproduction is altered by simulated herbicide drift toconstructed plant communities. ENVIRONMENTAL TOXICOLOGY AND CHEMISTRY. Society of Environmental Toxicology and Chemistry, Pensacola, FL, USA, 36(10): 2799-2813, (2017).
Treevill: N.B.G. Unique & Rare Raw Dataset
kaggle.com
zip
Updated Jan 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shuvo Kumar Basak-4004 (2025). Treevill: N.B.G. Unique & Rare Raw Dataset [Dataset]. https://www.kaggle.com/datasets/shuvokumarbasak4004/treevill-n-b-g-unique-and-rare-raw-dataset
Explore at:
zip(2423469043 bytes)Available download formats
Dataset updated
Jan 26, 2025
Authors
Shuvo Kumar Basak-4004
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Treevill: N.B.G. Unique & Rare Raw Dataset

This dataset, named Treevill: N.B.G. Unique & Rare Raw Dataset, is a collection of images sourced from the National Botanical Garden of Bangladesh (N.B.G.), showcasing a variety of unique and rare tree species. The dataset contains a total of 66 folders, each representing a specific tree species. For each species, approximately 2000 images are included, all resized to 256x256 pixels in JPEG format.

The dataset is intended for research, educational, and machine learning purposes, particularly in the fields of image classification, object recognition, and biodiversity studies. The high number of images per tree species ensures diversity in terms of tree angles, lighting, and conditions, which can be crucial for training machine learning models for species identification.

Procedure for Data Collection and Organization:

Data Collection: Images were collected from the National Botanical Garden of Bangladesh. Each tree species was carefully documented, ensuring that images captured a variety of perspectives and conditions for each species. Image Resizing: All images were resized to a standard resolution of 256x256 pixels for consistency in the dataset. Format Standardization: All images were converted into JPEG format, ensuring uniformity and ease of use in various applications. Folder Organization: Each species was assigned a unique folder in the dataset. These folders are named after the species they represent and contain approximately 2000 images each. Final Dataset: The final dataset consists of 66 folders, each dedicated to a specific tree species, making it easier to access and analyze the data for various tree-related research purposes.

List of Folders (Tree Species):

Akashmoni Aloe Wood Ashok Ashore Australian Pine Avocado Bahera Bamboo Banana Baro bottle brush Bazna Belati gab Bishop wood Blue Bellvine Buddha Coconut Camphor Tree Cannonball Tree Carambola Champaca Chaplash Civit Corkwood Crown Gardenia Debdaeu Devil Tree Dvils Cotton East Indian copaiba balsam Egyptian lotus Golden Shower Tree Guava Hairy Sterculia Haldu Haritaki Heaven Lotus Hijol Holudkrishnachura India Red Pear Jack Fruit Jiga Kamala Tree Kanjal Karanja Karen Wood Khejur Koinar Loha kat Mahogany Makri-shal Mango Marking Nut tree Mastwood Mexican lilac Mouskanda Nageshore Palm Piliostigma Prickly Tree Raktan Roskau Sada Golachi Shail Vadi Sisso Soap Nut Tree Teak The Poonspar Tree Udaya padda

Source: National Botanical Garden, Zoo Road, Dhaka, Bangladesh

Related links:

Shuvo, Shuvo Kumar Basak (2025), “Treevill: National Botanical Garden Unique & Rare Tree Argument Dataset ”, Mendeley Data, V1, doi: 10.17632/t7rwzgbfdd.1

https://doi.org/10.34740/KAGGLE/DSV/10582625

https://doi.org/10.34740/KAGGLE/DSV/10579609

https://doi.org/10.34740/KAGGLE/DSV/10579122

Treevill: N.B.G. Unique & Rare Raw Dataset - Access, Collaboration, and Paid Services Policy

I, Shuvo Kumar Basak, have created and curated the Treevill: N.B.G. Unique & Rare Raw Dataset, which consists of images of unique and rare tree species collected from the National Botanical Garden of Bangladesh. This dataset is freely available for research, educational, and non-commercial purposes.

Free Access to the Dataset: The Treevill: N.B.G. Unique & Rare Raw Dataset is available free of charge to all individuals and organizations for educational and research use. This is to support the advancement of knowledge and studies related to biodiversity, machine learning, and related fields.

Future Collaboration and Data Requests: While the dataset is provided free of charge, I encourage individuals and organizations to contact me directly if they need access to additional related data, further assistance, or if they plan on expanding their research in the future.

If you require any new data or specific related datasets, feel free to reach out to me, Shuvo Kumar Basak, for collaboration. I am happy to assist with additional data collection, cleaning, resizing, or other related services at a reasonable cost.

Paid Services - Hire for Data Collection: If you or your organization need custom data collection or wish to obtain related datasets beyond what is included in this collection, I offer a paid service to gather new data according to your specific requirements. This includes: Custom data collection for other tree species or related botanical data.

Data cleaning, resizing, and preprocessing to make the data ready for analysis.

Please contact me for a custom quote based on your specific needs. I will work with you to provide high-quality, tailored datasets to support your research, project, or business needs. Terms and Conditions: The dataset is intended for academic, research, and non-commercial purposes only. Redistribution or commercial use of the dataset without prior written co...

Cafe Sales - Dirty Data for Cleaning Training

kaggle.com

zip

Updated Jan 17, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Ahmed Mohamed (2025). Cafe Sales - Dirty Data for Cleaning Training [Dataset]. https://www.kaggle.com/datasets/ahmedmohamed2003/cafe-sales-dirty-data-for-cleaning-training

Explore at:

zip(113510 bytes)Available download formats

Dataset updated

Jan 17, 2025

Authors

Ahmed Mohamed

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Dirty Cafe Sales Dataset

Overview

The Dirty Cafe Sales dataset contains 10,000 rows of synthetic data representing sales transactions in a cafe. This dataset is intentionally "dirty," with missing values, inconsistent data, and errors introduced to provide a realistic scenario for data cleaning and exploratory data analysis (EDA). It can be used to practice cleaning techniques, data wrangling, and feature engineering.

File Information

File Name: dirty_cafe_sales.csv
Number of Rows: 10,000
Number of Columns: 8

Columns Description

Column Name	Description	Example Values
`Transaction ID`	A unique identifier for each transaction. Always present and unique.	`TXN_1234567`
`Item`	The name of the item purchased. May contain missing or invalid values (e.g., "ERROR").	`Coffee`, `Sandwich`
`Quantity`	The quantity of the item purchased. May contain missing or invalid values.	`1`, `3`, `UNKNOWN`
`Price Per Unit`	The price of a single unit of the item. May contain missing or invalid values.	`2.00`, `4.00`
`Total Spent`	The total amount spent on the transaction. Calculated as `Quantity * Price Per Unit`.	`8.00`, `12.00`
`Payment Method`	The method of payment used. May contain missing or invalid values (e.g., `None`, "UNKNOWN").	`Cash`, `Credit Card`
`Location`	The location where the transaction occurred. May contain missing or invalid values.	`In-store`, `Takeaway`
`Transaction Date`	The date of the transaction. May contain missing or incorrect values.	`2023-01-01`

Data Characteristics

Missing Values:
- Some columns (e.g., Item, Payment Method, Location) may contain missing values represented as None or empty cells.
Invalid Values:
- Some rows contain invalid entries like "ERROR" or "UNKNOWN" to simulate real-world data issues.
Price Consistency:
- Prices for menu items are consistent but may have missing or incorrect values introduced.

Menu Items

The dataset includes the following menu items with their respective price ranges:

Item	Price($)
Coffee	2
Tea	1.5
Sandwich	4
Salad	5
Cake	3
Cookie	1
Smoothie	4
Juice	3

Use Cases

This dataset is suitable for: - Practicing data cleaning techniques such as handling missing values, removing duplicates, and correcting invalid entries. - Exploring EDA techniques like visualizations and summary statistics. - Performing feature engineering for machine learning workflows.

Cleaning Steps Suggestions

To clean this dataset, consider the following steps: 1. Handle Missing Values: - Fill missing numeric values with the median or mean. - Replace missing categorical values with the mode or "Unknown."

Handle Invalid Values:
- Replace invalid entries like "ERROR" and "UNKNOWN" with NaN or appropriate values.
Date Consistency:
- Ensure all dates are in a consistent format.
- Fill missing dates with plausible values based on nearby records.
Feature Engineering:
- Create new columns, such as Day of the Week or Transaction Month, for further analysis.

License

This dataset is released under the CC BY-SA 4.0 License. You are free to use, share, and adapt it, provided you give appropriate credit.

Feedback

If you have any questions or feedback, feel free to reach out through the dataset's discussion board on Kaggle.

Raw data of survival analysis
figshare.com
xlsx
Updated Aug 20, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Li Gao (2020). Raw data of survival analysis [Dataset]. http://doi.org/10.6084/m9.figshare.12751439.v2
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12751439.v2
Dataset updated
Aug 20, 2020
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Li Gao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Raw data of survival analysis
Data from: Does the Disclosure of Gun Ownership Affect Crime? Evidence from...
search.datacite.org
openicpsr.org
+1more
Updated 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Tannenbaum (2018). Does the Disclosure of Gun Ownership Affect Crime? Evidence from New York [Dataset]. http://doi.org/10.3886/e109802v1
Explore at:
Unique identifier
https://doi.org/10.3886/e109802v1
Dataset updated
2018
Dataset provided by
Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
DataCitehttps://www.datacite.org/
Authors
Daniel Tannenbaum
Description
This repository contains the data and code necessary to replicate all figures and tables in the working paper: "Does the disclosure of gun ownership affect crime? Evidence from New York" by Daniel Tannenbaum
There are four folders in this repository:(1) Build: contains all the .do files required to produce the analysis datasets, using the raw data (i.e. datasets in the RawData folder).(2) Analysis: contains all the .do files required to produce all the figures and tables in the paper, using the analysis datasets (i.e. datasets in the AnalysisData folder).(3) RawData: contains all the raw datasets used to produce the AnalysisData datasets. The only raw dataset used in the paper that is excluded from this folder is the proprietary housing assessor and sales transaction data from DataQuick, owned by Corelogic. If I receive approval to include this raw data in this repository I will do so in future versions of this repository.(4) AnalysisData: contains all the analysis datasets that are created using the Build and are used to produce the tables and figures in the paper.

Running the file Master_analysis.do in the Analysis folder will produce, in one script, all the tables and figures in the paper.
Surveys of Data Professionals (Alex the Analyst)
kaggle.com
zip
Updated Nov 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stewie (2023). Surveys of Data Professionals (Alex the Analyst) [Dataset]. https://www.kaggle.com/datasets/alexenderjunior/surveys-of-data-professionals-alex-the-analyst
Explore at:
zip(81050 bytes)Available download formats
Dataset updated
Nov 27, 2023
Authors
Stewie
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
[Dataset Name] - About This Dataset

Overview

This dataset is used in a data cleaning project based on the raw data from Alex the Analyst's Power BI tutorial series. The original dataset can be found here.

Context

The dataset is employed in a mini project that involves cleaning and preparing data for analysis. It is part of a series of exercises aimed at enhancing skills in data cleaning using Pandas.

Content

The dataset contains information related to [provide a brief description of the data, e.g., sales, customer information, etc.]. The columns cover various aspects such as [list key columns and their meanings].

Acknowledgements

The original dataset is sourced from Alex the Analyst's Power BI tutorial series. Special thanks to [provide credit or acknowledgment] for making the dataset available.

Citation

If you use this dataset in your work, please cite it as follows:

How to Use

Download the dataset from this link.

Explore the Jupyter Notebook in the associated repository for insights into the data cleaning process.

Feel free to reach out for any additional information or clarification. Happy analyzing!
Massive Bank dataset ( 1 Million+ rows)
kaggle.com
zip
Updated Feb 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
K S ABISHEK (2023). Massive Bank dataset ( 1 Million+ rows) [Dataset]. https://www.kaggle.com/datasets/ksabishek/massive-bank-dataset-1-million-rows
Explore at:
zip(32471013 bytes)Available download formats
Dataset updated
Feb 21, 2023
Authors
K S ABISHEK
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Greetings , fellow analysts !

(NOTE : This is a random dataset generated using python. It bears no resemblance to any real entity in the corporate world. Any resemblance is a matter of coincidence.)

REC-SSEC Bank is a govt-aided bank operating in the Indian Peninsula. They have regional branches in over 40+ regions of the country. You have been provided with a massive excel sheet containing the transaction details, the total transaction amount and their location and total transaction count.

The dataset is described as follows :

Date - The date on which the transaction took place. 2.Domain - Where or which type of Business entity made the transaction. 3.Location - Where the data is collected from 4.Value - Total value of transaction

Count of transaction .

For example , in the very first row , the data can be read as : " On the first of January, 2022 , 1932 transactions of summing upto INR 365554 from Bhuj were reported " NOTE : There are about 2750 transactions every single day. All of this has been given to you.

The bank wants you to answer the following questions :

What is the average transaction value everyday for each domain over the year.

What is the average transaction value for every city/location over the year

The bank CEO , Mr: Hariharan , wants to promote the ease of transaction for the highest active domain. If the domains could be sorted into a priority, what would be the priority list ?

What's the average transaction count for each city ?
RAW Data Excel and SPSS
figshare.com
xlsx
Updated Sep 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jamil Ahmed (2024). RAW Data Excel and SPSS [Dataset]. http://doi.org/10.6084/m9.figshare.27101149.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27101149.v1
Dataset updated
Sep 25, 2024
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Jamil Ahmed
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This cross-sectional study aimed to determine the prevalence of obesity and perceived barriers to weight loss in 1453 Bahraini adults who had used any intervention to lose weight in the past year. We found a high prevalence (78.2%) of overweight and obesity. Females were more likely to have obesity compared to males (81.4% vs. 66.7%). Older individuals aged 36-45 were 3.37 times, and 45 or older were 3.56 times more likely to have obesity. Married participants had higher odds of obesity compared to single participants (OR=1.79). Participants with obesity were more likely to be unemployed compared to students (OR=1.49). The most common contributing factors to weight gain were lack of physical activity (29.5%) and unhealthy diet (29.2%). Participants with obesity were more likely to have relied on dieting (OR=2.53) or exercise (OR=1.47) for weight loss and used medication (OR=5.23). This study highlights the complex relationship between sociodemographic factors, lifestyle behaviors, and obesity and sustaining weight loss.
S
SmartPLS Data
scidb.cn
Updated Feb 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pham Hoang Hien (2023). SmartPLS Data [Dataset]. http://doi.org/10.57760/sciencedb.07443
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.07443
Dataset updated
Feb 17, 2023
Dataset provided by
Science Data Bank
Authors
Pham Hoang Hien
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This Dataset is used for SEM Analysis with SmartPLS
d
Coresignal | Employee Data | From the Largest Professional Network | Global...
datarade.ai
.json, .csv
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Coresignal, Coresignal | Employee Data | From the Largest Professional Network | Global / 712M+ Records / 5 Years of Historical Data / Updated Daily [Dataset]. https://datarade.ai/data-products/public-resume-data-coresignal
Explore at:
.json, .csvAvailable download formats
Dataset authored and provided by
Coresignal
Area covered
Palestine, Brunei Darussalam, Macao, Russian Federation, French Guiana, Latvia, Eritrea, Réunion, Christmas Island, Bosnia and Herzegovina
Description
➡️ You can choose from multiple data formats, delivery frequency options, and delivery methods;

➡️ You can select raw or clean and AI-enriched datasets;

➡️ Multiple APIs designed for effortless search and enrichment (accessible using a user-friendly self-service tool);

➡️ Fresh data: daily updates, easy change tracking with dedicated data fields, and a constant flow of new data;

➡️ You get all necessary resources for evaluating our data: a free consultation, a data sample, or free credits for testing our APIs.

Coresignal's employee data enables you to create and improve innovative data-driven solutions and extract actionable business insights. These datasets are popular among companies from different industries, including HR and sales technology and investment.

Employee Data use cases:

✅ Source best-fit talent for your recruitment needs

Coresignal's Employee Data can help source the best-fit talent for your recruitment needs by providing the most up-to-date information on qualified candidates globally.

✅ Fuel your lead generation pipeline

Enhance lead generation with 712M+ up-to-date employee records from the largest professional network. Our Employee Data can help you develop a qualified list of potential clients and enrich your own database.

✅ Analyze talent for investment opportunities

Employee Data can help you generate actionable signals and identify new investment opportunities earlier than competitors or perform deeper analysis of companies you're interested in.

➡️ Why 400+ data-powered businesses choose Coresignal:

Experienced data provider (in the market since 2016);

Exceptional client service;

Responsible and secure data collection.
Datasets for Sentiment Analysis
zenodo.org
csv
Updated Dec 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias (2023). Datasets for Sentiment Analysis [Dataset]. http://doi.org/10.5281/zenodo.10157504
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10157504
Dataset updated
Dec 10, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository was created for my Master's thesis in Computational Intelligence and Internet of Things at the University of Córdoba, Spain. The purpose of this repository is to store the datasets found that were used in some of the studies that served as research material for this Master's thesis. Also, the datasets used in the experimental part of this work are included.
Below are the datasets specified, along with the details of their references, authors, and download sources.

----------- STS-Gold Dataset ----------------
The dataset consists of 2026 tweets. The file consists of 3 columns: id, polarity, and tweet. The three columns denote the unique id, polarity index of the text and the tweet text respectively.
Reference: Saif, H., Fernandez, M., He, Y., & Alani, H. (2013). Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold.
File name: sts_gold_tweet.csv
----------- Amazon Sales Dataset ----------------
This dataset is having the data of 1K+ Amazon Product's Ratings and Reviews as per their details listed on the official website of Amazon. The data was scraped in the month of January 2023 from the Official Website of Amazon.
Owner: Karkavelraja J., Postgraduate student at Puducherry Technological University (Puducherry, Puducherry, India)
Features:
product_id - Product ID
product_name - Name of the Product
category - Category of the Product
discounted_price - Discounted Price of the Product
actual_price - Actual Price of the Product
discount_percentage - Percentage of Discount for the Product
rating - Rating of the Product
rating_count - Number of people who voted for the Amazon rating
about_product - Description about the Product
user_id - ID of the user who wrote review for the Product
user_name - Name of the user who wrote review for the Product
review_id - ID of the user review
review_title - Short review
review_content - Long review
img_link - Image Link of the Product
product_link - Official Website Link of the Product
License: CC BY-NC-SA 4.0
File name: amazon.csv
----------- Rotten Tomatoes Reviews Dataset ----------------
This rating inference dataset is a sentiment classification dataset, containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. On average, these reviews consist of 21 words. The first 5331 rows contains only negative samples and the last 5331 rows contain only positive samples, thus the data should be shuffled before usage.
This data is collected from https://www.cs.cornell.edu/people/pabo/movie-review-data/ as a txt file and converted into a csv file. The file consists of 2 columns: reviews and labels (1 for fresh (good) and 0 for rotten (bad)).
Reference: Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 115–124, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics
File name: data_rt.csv
----------- Preprocessed Dataset Sentiment Analysis ----------------
Preprocessed amazon product review data of Gen3EcoDot (Alexa) scrapped entirely from amazon.in
Stemmed and lemmatized using nltk.
Sentiment labels are generated using TextBlob polarity scores.
The file consists of 4 columns: index, review (stemmed and lemmatized review using nltk), polarity (score) and division (categorical label generated using polarity score).
DOI: 10.34740/kaggle/dsv/3877817
Citation: @misc{pradeesh arumadi_2022, title={Preprocessed Dataset Sentiment Analysis}, url={https://www.kaggle.com/dsv/3877817}, DOI={10.34740/KAGGLE/DSV/3877817}, publisher={Kaggle}, author={Pradeesh Arumadi}, year={2022} }
This dataset was used in the experimental phase of my research.
File name: EcoPreprocessed.csv
----------- Amazon Earphones Reviews ----------------
This dataset consists of a 9930 Amazon reviews, star ratings, for 10 latest (as of mid-2019) bluetooth earphone devices for learning how to train Machine for sentiment analysis.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 5 columns: ReviewTitle, ReviewBody, ReviewStar, Product and division (manually added - categorical label generated using ReviewStar score)
License: U.S. Government Works
Source: www.amazon.in
File name (original): AllProductReviews.csv (contains 14337 reviews)
File name (edited - used for my research) : AllProductReviews2.csv (contains 9930 reviews)
----------- Amazon Musical Instruments Reviews ----------------
This dataset contains 7137 comments/reviews of different musical instruments coming from Amazon.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 10 columns: reviewerID, asin (ID of the product), reviewerName, helpful (helpfulness rating of the review), reviewText, overall (rating of the product), summary (summary of the review), unixReviewTime (time of the review - unix time), reviewTime (time of the review (raw) and division (manually added - categorical label generated using overall score).
Source: http://jmcauley.ucsd.edu/data/amazon/
File name (original): Musical_instruments_reviews.csv (contains 10261 reviews)
File name (edited - used for my research) : Musical_instruments_reviews2.csv (contains 7137 reviews)
Data from: Raw data files
figshare.com
bin
Updated Mar 26, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ronen Schuster (2021). Raw data files [Dataset]. http://doi.org/10.6084/m9.figshare.14319758.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.14319758.v1
Dataset updated
Mar 26, 2021
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Ronen Schuster
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Raw data tables and the statistical analysis applied to the data. Files are labeled by figure number. Within each file, each table and linked graph and analysis is annotated by figure number and panel letter. All files are generated in graphpad prism.
f
Data Processing Has Major Impact on the Outcome of Quantitative Label-Free...
acs.figshare.com
zip
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aakash Chawade; Marianne Sandin; Johan Teleman; Johan Malmström; Fredrik Levander (2023). Data Processing Has Major Impact on the Outcome of Quantitative Label-Free LC-MS Analysis [Dataset]. http://doi.org/10.1021/pr500665j.s003
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/pr500665j.s003
Dataset updated
Jun 1, 2023
Dataset provided by
ACS Publications
Authors
Aakash Chawade; Marianne Sandin; Johan Teleman; Johan Malmström; Fredrik Levander
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
High-throughput multiplexed protein quantification using mass spectrometry is steadily increasing in popularity, with the two major techniques being data-dependent acquisition (DDA) and targeted acquisition using selected reaction monitoring (SRM). However, both techniques involve extensive data processing, which can be performed by a multitude of different software solutions. Analysis of quantitative LC-MS/MS data is mainly performed in three major steps: processing of raw data, normalization, and statistical analysis. To evaluate the impact of data processing steps, we developed two new benchmark data sets, one each for DDA and SRM, with samples consisting of a long-range dilution series of synthetic peptides spiked in a total cell protein digest. The generated data were processed by eight different software workflows and three postprocessing steps. The results show that the choice of the raw data processing software and the postprocessing steps play an important role in the final outcome. Also, the linear dynamic range of the DDA data could be extended by an order of magnitude through feature alignment and a charge state merging algorithm proposed here. Furthermore, the benchmark data sets are made publicly available for further benchmarking and software developments.

Netflix Data: Cleaning, Analysis and Visualization

kaggle.com

zip

Updated Aug 26, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

Abdulrasaq Ariyo (2022). Netflix Data: Cleaning, Analysis and Visualization [Dataset]. https://www.kaggle.com/datasets/ariyoomotade/netflix-data-cleaning-analysis-and-visualization

Explore at:

zip(276607 bytes)Available download formats

Dataset updated

Aug 26, 2022

Authors

Abdulrasaq Ariyo

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Netflix is a popular streaming service that offers a vast catalog of movies, TV shows, and original contents. This dataset is a cleaned version of the original version which can be found here. The data consist of contents added to Netflix from 2008 to 2021. The oldest content is as old as 1925 and the newest as 2021. This dataset will be cleaned with PostgreSQL and visualized with Tableau. The purpose of this dataset is to test my data cleaning and visualization skills. The cleaned data can be found below and the Tableau dashboard can be found here .

Data Cleaning

We are going to: 1. Treat the Nulls 2. Treat the duplicates 3. Populate missing rows 4. Drop unneeded columns 5. Split columns Extra steps and more explanation on the process will be explained through the code comments

--View dataset

SELECT * 
FROM netflix;

--The show_id column is the unique id for the dataset, therefore we are going to check for duplicates
                                  
SELECT show_id, COUNT(*)                                                                                      
FROM netflix 
GROUP BY show_id                                                                                              
ORDER BY show_id DESC;

--No duplicates

--Check null values across columns

SELECT COUNT(*) FILTER (WHERE show_id IS NULL) AS showid_nulls,
    COUNT(*) FILTER (WHERE type IS NULL) AS type_nulls,
    COUNT(*) FILTER (WHERE title IS NULL) AS title_nulls,
    COUNT(*) FILTER (WHERE director IS NULL) AS director_nulls,
    COUNT(*) FILTER (WHERE movie_cast IS NULL) AS movie_cast_nulls,
    COUNT(*) FILTER (WHERE country IS NULL) AS country_nulls,
    COUNT(*) FILTER (WHERE date_added IS NULL) AS date_addes_nulls,
    COUNT(*) FILTER (WHERE release_year IS NULL) AS release_year_nulls,
    COUNT(*) FILTER (WHERE rating IS NULL) AS rating_nulls,
    COUNT(*) FILTER (WHERE duration IS NULL) AS duration_nulls,
    COUNT(*) FILTER (WHERE listed_in IS NULL) AS listed_in_nulls,
    COUNT(*) FILTER (WHERE description IS NULL) AS description_nulls
FROM netflix;

We can see that there are NULLS. 
director_nulls = 2634
movie_cast_nulls = 825
country_nulls = 831
date_added_nulls = 10
rating_nulls = 4
duration_nulls = 3

The director column nulls is about 30% of the whole column, therefore I will not delete them. I will rather find another column to populate it. To populate the director column, we want to find out if there is relationship between movie_cast column and director column

-- Below, we find out if some directors are likely to work with particular cast

WITH cte AS
(
SELECT title, CONCAT(director, '---', movie_cast) AS director_cast 
FROM netflix
)

SELECT director_cast, COUNT(*) AS count
FROM cte
GROUP BY director_cast
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC;

With this, we can now populate NULL rows in directors 
using their record with movie_cast

UPDATE netflix 
SET director = 'Alastair Fothergill'
WHERE movie_cast = 'David Attenborough'
AND director IS NULL ;

--Repeat this step to populate the rest of the director nulls
--Populate the rest of the NULL in director as "Not Given"

UPDATE netflix 
SET director = 'Not Given'
WHERE director IS NULL;

--When I was doing this, I found a less complex and faster way to populate a column which I will use next

Just like the director column, I will not delete the nulls in country. Since the country column is related to director and movie, we are going to populate the country column with the director column

--Populate the country using the director column

SELECT COALESCE(nt.country,nt2.country) 
FROM netflix AS nt
JOIN netflix AS nt2 
ON nt.director = nt2.director 
AND nt.show_id <> nt2.show_id
WHERE nt.country IS NULL;
UPDATE netflix
SET country = nt2.country
FROM netflix AS nt2
WHERE netflix.director = nt2.director and netflix.show_id <> nt2.show_id 
AND netflix.country IS NULL;


--To confirm if there are still directors linked to country that refuse to update

SELECT director, country, date_added
FROM netflix
WHERE country IS NULL;

--Populate the rest of the NULL in director as "Not Given"

UPDATE netflix 
SET country = 'Not Given'
WHERE country IS NULL;

The date_added rows nulls is just 10 out of over 8000 rows, deleting them cannot affect our analysis or visualization

--Show date_added nulls

SELECT show_id, date_added
FROM netflix_clean
WHERE date_added IS NULL;

--DELETE nulls

DELETE F...

Facebook

Twitter

Click to copy link

Link copied

Cite

Ann Truong (2023). Scooter Sales - Excel Project [Dataset]. https://www.kaggle.com/datasets/bvanntruong/scooter-sales-excel-project

Scooter Sales - Excel Project

Salesperson data from scooter sales

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jun 8, 2023

Dataset provided by

Kaggle

Authors

Ann Truong

Description

The link for the Excel project to download can be found on GitHub here. It includes the raw data, Pivot Tables, and an interactive dashboard with Pivot Charts and Slicers. The project also includes business questions and the formulas I used to answer. The image below is included for ease. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12904052%2F61e460b5f6a1fa73cfaaa33aa8107bd5%2FBusinessQuestions.png?generation=1686190703261971&alt=media" alt=""> The link for the Tableau adjusted dashboard can be found here.

A screenshot of the interactive Excel dashboard is also included below for ease. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12904052%2Fe581f1fce8afc732f7823904da9e4cce%2FScooter%20Dashboard%20Image.png?generation=1686190815608343&alt=media" alt="">

Clear search

Close search

Google apps

Main menu

Scooter Sales - Excel Project

Data from: Current and projected research data storage needs of Agricultural...

HR Analytics Dataset

Supply Chain DataSet

Data Cleaning Sample

CSV file used in statistical analyses

18 excel spreadsheets by species and year giving reproduction and growth...

Treevill: N.B.G. Unique & Rare Raw Dataset

Cafe Sales - Dirty Data for Cleaning Training

Dirty Cafe Sales Dataset

Overview

File Information

Columns Description

Data Characteristics

Menu Items

Use Cases

Cleaning Steps Suggestions

License

Feedback

Raw data of survival analysis

Data from: Does the Disclosure of Gun Ownership Affect Crime? Evidence from...

Surveys of Data Professionals (Alex the Analyst)

[Dataset Name] - About This Dataset

Overview

Context

Content

Acknowledgements

Citation

How to Use

Massive Bank dataset ( 1 Million+ rows)

RAW Data Excel and SPSS

SmartPLS Data

Coresignal | Employee Data | From the Largest Professional Network | Global...

Datasets for Sentiment Analysis

Data from: Raw data files

Data Processing Has Major Impact on the Outcome of Quantitative Label-Free...

Netflix Data: Cleaning, Analysis and Visualization

Data Cleaning

Scooter Sales - Excel Project

Salesperson data from scooter sales