100+ datasets found
  1. m

    Raw data outputs 1-18

    • bridges.monash.edu
    • researchdata.edu.au
    xlsx
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abbas Salavaty Hosein Abadi; Sara Alaei; Mirana Ramialison; Peter Currie (2023). Raw data outputs 1-18 [Dataset]. http://doi.org/10.26180/21259491.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Monash University
    Authors
    Abbas Salavaty Hosein Abadi; Sara Alaei; Mirana Ramialison; Peter Currie
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Raw data outputs 1-18 Raw data output 1. Differentially expressed genes in AML CSCs compared with GTCs as well as in TCGA AML cancer samples compared with normal ones. This data was generated based on the results of AML microarray and TCGA data analysis. Raw data output 2. Commonly and uniquely differentially expressed genes in AML CSC/GTC microarray and TCGA bulk RNA-seq datasets. This data was generated based on the results of AML microarray and TCGA data analysis. Raw data output 3. Common differentially expressed genes between training and test set samples the microarray dataset. This data was generated based on the results of AML microarray data analysis. Raw data output 4. Detailed information on the samples of the breast cancer microarray dataset (GSE52327) used in this study. Raw data output 5. Differentially expressed genes in breast CSCs compared with GTCs as well as in TCGA BRCA cancer samples compared with normal ones. Raw data output 6. Commonly and uniquely differentially expressed genes in breast cancer CSC/GTC microarray and TCGA BRCA bulk RNA-seq datasets. This data was generated based on the results of breast cancer microarray and TCGA BRCA data analysis. CSC, and GTC are abbreviations of cancer stem cell, and general tumor cell, respectively. Raw data output 7. Differential and common co-expression and protein-protein interaction of genes between CSC and GTC samples. This data was generated based on the results of AML microarray and STRING database-based protein-protein interaction data analysis. CSC, and GTC are abbreviations of cancer stem cell, and general tumor cell, respectively. Raw data output 8. Differentially expressed genes between AML dormant and active CSCs. This data was generated based on the results of AML scRNA-seq data analysis. Raw data output 9. Uniquely expressed genes in dormant or active AML CSCs. This data was generated based on the results of AML scRNA-seq data analysis. Raw data output 10. Intersections between the targeting transcription factors of AML key CSC genes and differentially expressed genes between AML CSCs vs GTCs and between dormant and active AML CSCs or the uniquely expressed genes in either class of CSCs. Raw data output 11. Targeting desirableness score of AML key CSC genes and their targeting transcription factors. These scores were generated based on an in-house scoring function described in the Methods section. Raw data output 12. CSC-specific targeting desirableness score of AML key CSC genes and their targeting transcription factors. These scores were generated based on an in-house scoring function described in the Methods section. Raw data output 13. The protein-protein interactions between AML key CSC genes with themselves and their targeting transcription factors. This data was generated based on the results of AML microarray and STRING database-based protein-protein interaction data analysis. Raw data output 14. The previously confirmed associations of genes having the highest targeting desirableness and CSC-specific targeting desirableness scores with AML or other cancers’ (stem) cells as well as hematopoietic stem cells. These data were generated based on a PubMed database-based literature mining. Raw data output 15. Drug score of available drugs and bioactive small molecules targeting AML key CSC genes and/or their targeting transcription factors. These scores were generated based on an in-house scoring function described in the Methods section. Raw data output 16. CSC-specific drug score of available drugs and bioactive small molecules targeting AML key CSC genes and/or their targeting transcription factors. These scores were generated based on an in-house scoring function described in the Methods section. Raw data output 17. Candidate drugs for experimental validation. These drugs were selected based on their respective (CSC-specific) drug scores. CSC is the abbreviation of cancer stem cell. Raw data output 18. Detailed information on the samples of the AML microarray dataset GSE30375 used in this study.

  2. HR Analytics Dataset

    • kaggle.com
    zip
    Updated Oct 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    anshika2301 (2023). HR Analytics Dataset [Dataset]. https://www.kaggle.com/datasets/anshika2301/hr-analytics-dataset
    Explore at:
    zip(213690 bytes)Available download formats
    Dataset updated
    Oct 27, 2023
    Authors
    anshika2301
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    HR analytics, also referred to as people analytics, workforce analytics, or talent analytics, involves gathering together, analyzing, and reporting HR data. It is the collection and application of talent data to improve critical talent and business outcomes. It enables your organization to measure the impact of a range of HR metrics on overall business performance and make decisions based on data. They are primarily responsible for interpreting and analyzing vast datasets.

    Download the data CSV files here ; https://drive.google.com/drive/folders/18mQalCEyZypeV8TJeP3SME_R6qsCS2Og

  3. Data analysis project

    • kaggle.com
    zip
    Updated Aug 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SzymonBiaas (2024). Data analysis project [Dataset]. https://www.kaggle.com/datasets/szymonbiaas/data-analysis-project
    Explore at:
    zip(116043378 bytes)Available download formats
    Dataset updated
    Aug 15, 2024
    Authors
    SzymonBiaas
    Description

    This dashboard was created from data published by Olist Store (a Brazilian e-commerce public dataset). Raw data contains information about 100 000 orders from 2016 to 2018 placed in many regions of Brazil.

    The raw datasets were imported into Excel using “Get data” option (formerly known as “Power Query”) and cleaned. An additional table with the names of Brazilian states was also imported from the Wikipedia page.

    A Data Table about payment information was created based on imported statistics with the usage of nested formulas. Then, proper pivot charts were used to build an Olist Store Payment Dashboard which allows you to review the data using a connected timeline and slicers.

  4. Raw data from datasets used in SIMON analysis

    • data.europa.eu
    unknown
    Updated Jan 27, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2022). Raw data from datasets used in SIMON analysis [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-2580414?locale=hr
    Explore at:
    unknown(312591)Available download formats
    Dataset updated
    Jan 27, 2022
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Here you can find raw data and information about each of the 34 datasets generated by the mulset algorithm and used for further analysis in SIMON. Each dataset is stored in separate folder which contains 4 files: json_info: This file contains, number of features with their names and number of subjects that are available for the same dataset data_testing: data frame with data used to test trained model data_training: data frame with data used to train models results: direct unfiltered data from database Files are written in feather format. Here is an example of data structure for each file in repository. File was compressed using 7-Zip available at https://www.7-zip.org/.

  5. Datasets for Sentiment Analysis

    • zenodo.org
    csv
    Updated Dec 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias (2023). Datasets for Sentiment Analysis [Dataset]. http://doi.org/10.5281/zenodo.10157504
    Explore at:
    csvAvailable download formats
    Dataset updated
    Dec 10, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository was created for my Master's thesis in Computational Intelligence and Internet of Things at the University of Córdoba, Spain. The purpose of this repository is to store the datasets found that were used in some of the studies that served as research material for this Master's thesis. Also, the datasets used in the experimental part of this work are included.

    Below are the datasets specified, along with the details of their references, authors, and download sources.

    ----------- STS-Gold Dataset ----------------

    The dataset consists of 2026 tweets. The file consists of 3 columns: id, polarity, and tweet. The three columns denote the unique id, polarity index of the text and the tweet text respectively.

    Reference: Saif, H., Fernandez, M., He, Y., & Alani, H. (2013). Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold.

    File name: sts_gold_tweet.csv

    ----------- Amazon Sales Dataset ----------------

    This dataset is having the data of 1K+ Amazon Product's Ratings and Reviews as per their details listed on the official website of Amazon. The data was scraped in the month of January 2023 from the Official Website of Amazon.

    Owner: Karkavelraja J., Postgraduate student at Puducherry Technological University (Puducherry, Puducherry, India)

    Features:

    • product_id - Product ID
    • product_name - Name of the Product
    • category - Category of the Product
    • discounted_price - Discounted Price of the Product
    • actual_price - Actual Price of the Product
    • discount_percentage - Percentage of Discount for the Product
    • rating - Rating of the Product
    • rating_count - Number of people who voted for the Amazon rating
    • about_product - Description about the Product
    • user_id - ID of the user who wrote review for the Product
    • user_name - Name of the user who wrote review for the Product
    • review_id - ID of the user review
    • review_title - Short review
    • review_content - Long review
    • img_link - Image Link of the Product
    • product_link - Official Website Link of the Product

    License: CC BY-NC-SA 4.0

    File name: amazon.csv

    ----------- Rotten Tomatoes Reviews Dataset ----------------

    This rating inference dataset is a sentiment classification dataset, containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. On average, these reviews consist of 21 words. The first 5331 rows contains only negative samples and the last 5331 rows contain only positive samples, thus the data should be shuffled before usage.

    This data is collected from https://www.cs.cornell.edu/people/pabo/movie-review-data/ as a txt file and converted into a csv file. The file consists of 2 columns: reviews and labels (1 for fresh (good) and 0 for rotten (bad)).

    Reference: Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 115–124, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics

    File name: data_rt.csv

    ----------- Preprocessed Dataset Sentiment Analysis ----------------

    Preprocessed amazon product review data of Gen3EcoDot (Alexa) scrapped entirely from amazon.in
    Stemmed and lemmatized using nltk.
    Sentiment labels are generated using TextBlob polarity scores.

    The file consists of 4 columns: index, review (stemmed and lemmatized review using nltk), polarity (score) and division (categorical label generated using polarity score).

    DOI: 10.34740/kaggle/dsv/3877817

    Citation: @misc{pradeesh arumadi_2022, title={Preprocessed Dataset Sentiment Analysis}, url={https://www.kaggle.com/dsv/3877817}, DOI={10.34740/KAGGLE/DSV/3877817}, publisher={Kaggle}, author={Pradeesh Arumadi}, year={2022} }

    This dataset was used in the experimental phase of my research.

    File name: EcoPreprocessed.csv

    ----------- Amazon Earphones Reviews ----------------

    This dataset consists of a 9930 Amazon reviews, star ratings, for 10 latest (as of mid-2019) bluetooth earphone devices for learning how to train Machine for sentiment analysis.

    This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.

    The file consists of 5 columns: ReviewTitle, ReviewBody, ReviewStar, Product and division (manually added - categorical label generated using ReviewStar score)

    License: U.S. Government Works

    Source: www.amazon.in

    File name (original): AllProductReviews.csv (contains 14337 reviews)

    File name (edited - used for my research) : AllProductReviews2.csv (contains 9930 reviews)

    ----------- Amazon Musical Instruments Reviews ----------------

    This dataset contains 7137 comments/reviews of different musical instruments coming from Amazon.

    This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.

    The file consists of 10 columns: reviewerID, asin (ID of the product), reviewerName, helpful (helpfulness rating of the review), reviewText, overall (rating of the product), summary (summary of the review), unixReviewTime (time of the review - unix time), reviewTime (time of the review (raw) and division (manually added - categorical label generated using overall score).

    Source: http://jmcauley.ucsd.edu/data/amazon/

    File name (original): Musical_instruments_reviews.csv (contains 10261 reviews)

    File name (edited - used for my research) : Musical_instruments_reviews2.csv (contains 7137 reviews)

  6. Supply Chain DataSet

    • kaggle.com
    zip
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amir Motefaker (2023). Supply Chain DataSet [Dataset]. https://www.kaggle.com/datasets/amirmotefaker/supply-chain-dataset
    Explore at:
    zip(9340 bytes)Available download formats
    Dataset updated
    Jun 1, 2023
    Authors
    Amir Motefaker
    Description

    Supply chain analytics is a valuable part of data-driven decision-making in various industries such as manufacturing, retail, healthcare, and logistics. It is the process of collecting, analyzing and interpreting data related to the movement of products and services from suppliers to customers.

  7. i

    Malware Analysis Datasets: Raw PE as Image

    • ieee-dataport.org
    Updated Nov 7, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Angelo Oliveira (2019). Malware Analysis Datasets: Raw PE as Image [Dataset]. https://ieee-dataport.org/open-access/malware-analysis-datasets-raw-pe-image
    Explore at:
    Dataset updated
    Nov 7, 2019
    Authors
    Angelo Oliveira
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is part of my PhD research on malware detection and classification using Deep Learning. It contains static analysis data: Raw PE byte stream rescaled to a 32 x 32 greyscale image using the Nearest Neighbor Interpolation algorithm and then flattened to a 1024 bytes vector. PE malware examples were downloaded from virusshare.com. PE goodware examples were downloaded from portableapps.com and from Windows 7 x86 directories.

  8. The raw datasets and statistics for each analysis result graph.

    • plos.figshare.com
    zip
    Updated Jun 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Barry Smyth (2023). The raw datasets and statistics for each analysis result graph. [Dataset]. http://doi.org/10.1371/journal.pone.0251513.s002
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 11, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Barry Smyth
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Each individual result graph is associated with 4 different comma-separated files: (i) Raw—the (anonymised) raw data behind the means and standard deviations used for a particular result graph; (ii) Paired—the paired statistical significance results; (iii) Successive Male—the statistical significance results to compare successive groups (age and ability) for male runners; and (iv) Successive Female—the corresponding results for the statistical significance tests to compare successive groups (age and ability) of female runners. (ZIP)

  9. c

    Target products dataset

    • crawlfeeds.com
    csv, zip
    Updated Sep 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crawl Feeds (2024). Target products dataset [Dataset]. https://crawlfeeds.com/datasets/target-products-dataset
    Explore at:
    zip, csvAvailable download formats
    Dataset updated
    Sep 10, 2024
    Dataset authored and provided by
    Crawl Feeds
    License

    https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy

    Description

    The Target Products Dataset is a robust collection in CSV format, featuring 1.3 million product records sourced from Target's online platform. This dataset contains rich details on a wide range of products, including product titles, URLs, pricing, availability, and more. It is an ideal resource for businesses, researchers, and data scientists interested in analyzing retail trends, product availability, and pricing strategies.

    Key Data Fields:

    • Title: Name of the product.
    • URL: Direct link to the product page.
    • Brand: The brand associated with the product.
    • Main Image: URL of the main product image.
    • SKU: Unique Stock Keeping Unit identifier.
    • Description: A structured product description.
    • Raw Description: The original product description before any processing.
    • GTIN13: Global Trade Item Number (GTIN) in 13-digit format.
    • Currency: The currency in which the product is priced.
    • Price: Price of the product.
    • Availability: Availability status of the product (e.g., in stock, out of stock).
    • Available Delivery Method: Methods through which the product can be delivered.
    • Available Branch: Information on availability at specific store locations.
    • Primary Category: The main category to which the product belongs.
    • Sub Category 1, 2, 3: Further sub-categorization of the product.
    • Images: URLs to additional product images.
    • Raw Specifications: Unprocessed specifications of the product.
    • Specifications: Structured product specifications.
    • Highlights: Key highlights and features of the product.
    • Raw Highlights: Unstructured highlights before processing.
    • Uniq ID: A unique identifier for each product.
    • Scraped At: The timestamp indicating when the data was collected.

    Use Cases:

    • Retail Analytics: Analyze pricing trends, brand popularity, and product availability across categories.
    • Product Categorization: Study the classification of products into primary and sub-categories.
    • E-commerce Analysis: Use this dataset for consumer behavior studies, inventory management, or competitive analysis.
    • Recommendation Systems: Build product recommendation engines using product features, pricing, and availability data.

  10. Treevill: N.B.G. Unique & Rare Raw Dataset

    • kaggle.com
    zip
    Updated Jan 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shuvo Kumar Basak-4004 (2025). Treevill: N.B.G. Unique & Rare Raw Dataset [Dataset]. https://www.kaggle.com/datasets/shuvokumarbasak4004/treevill-n-b-g-unique-and-rare-raw-dataset
    Explore at:
    zip(2423469043 bytes)Available download formats
    Dataset updated
    Jan 26, 2025
    Authors
    Shuvo Kumar Basak-4004
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Treevill: N.B.G. Unique & Rare Raw Dataset

    This dataset, named Treevill: N.B.G. Unique & Rare Raw Dataset, is a collection of images sourced from the National Botanical Garden of Bangladesh (N.B.G.), showcasing a variety of unique and rare tree species. The dataset contains a total of 66 folders, each representing a specific tree species. For each species, approximately 2000 images are included, all resized to 256x256 pixels in JPEG format.

    The dataset is intended for research, educational, and machine learning purposes, particularly in the fields of image classification, object recognition, and biodiversity studies. The high number of images per tree species ensures diversity in terms of tree angles, lighting, and conditions, which can be crucial for training machine learning models for species identification.

    Procedure for Data Collection and Organization:

    Data Collection: Images were collected from the National Botanical Garden of Bangladesh. Each tree species was carefully documented, ensuring that images captured a variety of perspectives and conditions for each species. Image Resizing: All images were resized to a standard resolution of 256x256 pixels for consistency in the dataset. Format Standardization: All images were converted into JPEG format, ensuring uniformity and ease of use in various applications. Folder Organization: Each species was assigned a unique folder in the dataset. These folders are named after the species they represent and contain approximately 2000 images each. Final Dataset: The final dataset consists of 66 folders, each dedicated to a specific tree species, making it easier to access and analyze the data for various tree-related research purposes.

    List of Folders (Tree Species):

    Akashmoni Aloe Wood Ashok Ashore Australian Pine Avocado Bahera Bamboo Banana Baro bottle brush Bazna Belati gab Bishop wood Blue Bellvine Buddha Coconut Camphor Tree Cannonball Tree Carambola Champaca Chaplash Civit Corkwood Crown Gardenia Debdaeu Devil Tree Dvils Cotton East Indian copaiba balsam Egyptian lotus Golden Shower Tree Guava Hairy Sterculia Haldu Haritaki Heaven Lotus Hijol Holudkrishnachura India Red Pear Jack Fruit Jiga Kamala Tree Kanjal Karanja Karen Wood Khejur Koinar Loha kat Mahogany Makri-shal Mango Marking Nut tree Mastwood Mexican lilac Mouskanda Nageshore Palm Piliostigma Prickly Tree Raktan Roskau Sada Golachi Shail Vadi Sisso Soap Nut Tree Teak The Poonspar Tree Udaya padda

    Source: National Botanical Garden, Zoo Road, Dhaka, Bangladesh

    Related links:

    Shuvo, Shuvo Kumar Basak (2025), “Treevill: National Botanical Garden Unique & Rare Tree Argument Dataset ”, Mendeley Data, V1, doi: 10.17632/t7rwzgbfdd.1

    https://doi.org/10.34740/KAGGLE/DSV/10582625

    https://doi.org/10.34740/KAGGLE/DSV/10579609

    https://doi.org/10.34740/KAGGLE/DSV/10579122

    Treevill: N.B.G. Unique & Rare Raw Dataset - Access, Collaboration, and Paid Services Policy

    I, Shuvo Kumar Basak, have created and curated the Treevill: N.B.G. Unique & Rare Raw Dataset, which consists of images of unique and rare tree species collected from the National Botanical Garden of Bangladesh. This dataset is freely available for research, educational, and non-commercial purposes.

    Free Access to the Dataset: The Treevill: N.B.G. Unique & Rare Raw Dataset is available free of charge to all individuals and organizations for educational and research use. This is to support the advancement of knowledge and studies related to biodiversity, machine learning, and related fields.

    Future Collaboration and Data Requests: While the dataset is provided free of charge, I encourage individuals and organizations to contact me directly if they need access to additional related data, further assistance, or if they plan on expanding their research in the future.

    If you require any new data or specific related datasets, feel free to reach out to me, Shuvo Kumar Basak, for collaboration. I am happy to assist with additional data collection, cleaning, resizing, or other related services at a reasonable cost.

    Paid Services - Hire for Data Collection: If you or your organization need custom data collection or wish to obtain related datasets beyond what is included in this collection, I offer a paid service to gather new data according to your specific requirements. This includes: Custom data collection for other tree species or related botanical data.

    Data cleaning, resizing, and preprocessing to make the data ready for analysis.

    Please contact me for a custom quote based on your specific needs. I will work with you to provide high-quality, tailored datasets to support your research, project, or business needs. Terms and Conditions: The dataset is intended for academic, research, and non-commercial purposes only. Redistribution or commercial use of the dataset without prior written co...

  11. Retail Store Sales: Dirty for Data Cleaning

    • kaggle.com
    zip
    Updated Jan 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmed Mohamed (2025). Retail Store Sales: Dirty for Data Cleaning [Dataset]. https://www.kaggle.com/datasets/ahmedmohamed2003/retail-store-sales-dirty-for-data-cleaning
    Explore at:
    zip(226740 bytes)Available download formats
    Dataset updated
    Jan 18, 2025
    Authors
    Ahmed Mohamed
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dirty Retail Store Sales Dataset

    Overview

    The Dirty Retail Store Sales dataset contains 12,575 rows of synthetic data representing sales transactions from a retail store. The dataset includes eight product categories with 25 items per category, each having static prices. It is designed to simulate real-world sales data, including intentional "dirtiness" such as missing or inconsistent values. This dataset is suitable for practicing data cleaning, exploratory data analysis (EDA), and feature engineering.

    File Information

    • File Name: retail_store_sales.csv
    • Number of Rows: 12,575
    • Number of Columns: 11

    Columns Description

    Column NameDescriptionExample Values
    Transaction IDA unique identifier for each transaction. Always present and unique.TXN_1234567
    Customer IDA unique identifier for each customer. 25 unique customers.CUST_01
    CategoryThe category of the purchased item.Food, Furniture
    ItemThe name of the purchased item. May contain missing values or None.Item_1_FOOD, None
    Price Per UnitThe static price of a single unit of the item. May contain missing or None values.4.00, None
    QuantityThe quantity of the item purchased. May contain missing or None values.1, None
    Total SpentThe total amount spent on the transaction. Calculated as Quantity * Price Per Unit.8.00, None
    Payment MethodThe method of payment used. May contain missing or invalid values.Cash, Credit Card
    LocationThe location where the transaction occurred. May contain missing or invalid values.In-store, Online
    Transaction DateThe date of the transaction. Always present and valid.2023-01-15
    Discount AppliedIndicates if a discount was applied to the transaction. May contain missing values.True, False, None

    Categories and Items

    The dataset includes the following categories, each containing 25 items with corresponding codes, names, and static prices:

    Electric Household Essentials

    Item CodeItem NamePrice
    Item_1_EHEBlender5.0
    Item_2_EHEMicrowave6.5
    Item_3_EHEToaster8.0
    Item_4_EHEVacuum Cleaner9.5
    Item_5_EHEAir Purifier11.0
    Item_6_EHEElectric Kettle12.5
    Item_7_EHERice Cooker14.0
    Item_8_EHEIron15.5
    Item_9_EHECeiling Fan17.0
    Item_10_EHETable Fan18.5
    Item_11_EHEHair Dryer20.0
    Item_12_EHEHeater21.5
    Item_13_EHEHumidifier23.0
    Item_14_EHEDehumidifier24.5
    Item_15_EHECoffee Maker26.0
    Item_16_EHEPortable AC27.5
    Item_17_EHEElectric Stove29.0
    Item_18_EHEPressure Cooker30.5
    Item_19_EHEInduction Cooktop32.0
    Item_20_EHEWater Dispenser33.5
    Item_21_EHEHand Blender35.0
    Item_22_EHEMixer Grinder36.5
    Item_23_EHESandwich Maker38.0
    Item_24_EHEAir Fryer39.5
    Item_25_EHEJuicer41.0

    Furniture

    Item CodeItem NamePrice
    Item_1_FUROffice Chair5.0
    Item_2_FURSofa6.5
    Item_3_FURCoffee Table8.0
    Item_4_FURDining Table9.5
    Item_5_FURBookshelf11.0
    Item_6_FURBed F...
  12. Z

    Quantitative raw data for "Large scale regional citizen surveys report"...

    • data.niaid.nih.gov
    • zenodo.org
    • +1more
    Updated Feb 3, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Panori, Anastasia; Bakratsas, Thomas; Chapizanis, Dimitrios; Altsitsiadis, Efthymios; Hauschildt, Christian (2022). Quantitative raw data for "Large scale regional citizen surveys report" (D1.4) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5958017
    Explore at:
    Dataset updated
    Feb 3, 2022
    Dataset provided by
    White Research SRL
    Authors
    Panori, Anastasia; Bakratsas, Thomas; Chapizanis, Dimitrios; Altsitsiadis, Efthymios; Hauschildt, Christian
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset presents the quantitative raw data that was collected under the H2020 RRI2SCALE project for the D1.4 - “Large scale regional citizen surveys report”. The dataset includes the answers that were provided by almost 8,000 participants from 4 pilot European regions (Kriti, Vestland, Galicia, and Overijssel) regarding the general public's views, concerns, and moral issues about the current and future trajectories of their RTD&I ecosystem. The original survey questionnaire was created by White Research SRL and disseminated to the regions through supporting pilot partners. Data collection took place from June 2020 to September 2020 through 4 different waves – one for each region. Based on the conclusion of a consortium vote during the kick-off meeting, it was decided that instead of resource-intensive methods that would render data collection unduly expensive, to fill in the quotas responses were collected through online panels by survey companies that were used for each region. For the statistical analysis of the data and the conclusions drawn from the analysis, you can access the "Large scale regional citizen surveys report" (D1.4).

  13. h

    dataset-tsql-data-analysis

    • huggingface.co
    Updated Jan 1, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Meldrum (2020). dataset-tsql-data-analysis [Dataset]. https://huggingface.co/datasets/dmeldrum6/dataset-tsql-data-analysis
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 1, 2020
    Authors
    David Meldrum
    Description

    Dataset Card for dataset-tsql-data-analysis

    This dataset has been created with distilabel.

      Dataset Summary
    

    This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/dmeldrum6/dataset-tsql-data-analysis/raw/main/pipeline.yaml"

    or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/dmeldrum6/dataset-tsql-data-analysis.

  14. f

    UC_vs_US Statistic Analysis.xlsx

    • figshare.com
    xlsx
    Updated Jul 9, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    F. (Fabiano) Dalpiaz (2020). UC_vs_US Statistic Analysis.xlsx [Dataset]. http://doi.org/10.23644/uu.12631628.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jul 9, 2020
    Dataset provided by
    Utrecht University
    Authors
    F. (Fabiano) Dalpiaz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Sheet 1 (Raw-Data): The raw data of the study is provided, presenting the tagging results for the used measures described in the paper. For each subject, it includes multiple columns: A. a sequential student ID B an ID that defines a random group label and the notation C. the used notation: user Story or use Cases D. the case they were assigned to: IFA, Sim, or Hos E. the subject's exam grade (total points out of 100). Empty cells mean that the subject did not take the first exam F. a categorical representation of the grade L/M/H, where H is greater or equal to 80, M is between 65 included and 80 excluded, L otherwise G. the total number of classes in the student's conceptual model H. the total number of relationships in the student's conceptual model I. the total number of classes in the expert's conceptual model J. the total number of relationships in the expert's conceptual model K-O. the total number of encountered situations of alignment, wrong representation, system-oriented, omitted, missing (see tagging scheme below) P. the researchers' judgement on how well the derivation process explanation was explained by the student: well explained (a systematic mapping that can be easily reproduced), partially explained (vague indication of the mapping ), or not present.

    Tagging scheme:
    Aligned (AL) - A concept is represented as a class in both models, either
    

    with the same name or using synonyms or clearly linkable names; Wrongly represented (WR) - A class in the domain expert model is incorrectly represented in the student model, either (i) via an attribute, method, or relationship rather than class, or (ii) using a generic term (e.g., user'' instead ofurban planner''); System-oriented (SO) - A class in CM-Stud that denotes a technical implementation aspect, e.g., access control. Classes that represent legacy system or the system under design (portal, simulator) are legitimate; Omitted (OM) - A class in CM-Expert that does not appear in any way in CM-Stud; Missing (MI) - A class in CM-Stud that does not appear in any way in CM-Expert.

    All the calculations and information provided in the following sheets
    

    originate from that raw data.

    Sheet 2 (Descriptive-Stats): Shows a summary of statistics from the data collection,
    

    including the number of subjects per case, per notation, per process derivation rigor category, and per exam grade category.

    Sheet 3 (Size-Ratio):
    

    The number of classes within the student model divided by the number of classes within the expert model is calculated (describing the size ratio). We provide box plots to allow a visual comparison of the shape of the distribution, its central value, and its variability for each group (by case, notation, process, and exam grade) . The primary focus in this study is on the number of classes. However, we also provided the size ratio for the number of relationships between student and expert model.

    Sheet 4 (Overall):
    

    Provides an overview of all subjects regarding the encountered situations, completeness, and correctness, respectively. Correctness is defined as the ratio of classes in a student model that is fully aligned with the classes in the corresponding expert model. It is calculated by dividing the number of aligned concepts (AL) by the sum of the number of aligned concepts (AL), omitted concepts (OM), system-oriented concepts (SO), and wrong representations (WR). Completeness on the other hand, is defined as the ratio of classes in a student model that are correctly or incorrectly represented over the number of classes in the expert model. Completeness is calculated by dividing the sum of aligned concepts (AL) and wrong representations (WR) by the sum of the number of aligned concepts (AL), wrong representations (WR) and omitted concepts (OM). The overview is complemented with general diverging stacked bar charts that illustrate correctness and completeness.

    For sheet 4 as well as for the following four sheets, diverging stacked bar
    

    charts are provided to visualize the effect of each of the independent and mediated variables. The charts are based on the relative numbers of encountered situations for each student. In addition, a "Buffer" is calculated witch solely serves the purpose of constructing the diverging stacked bar charts in Excel. Finally, at the bottom of each sheet, the significance (T-test) and effect size (Hedges' g) for both completeness and correctness are provided. Hedges' g was calculated with an online tool: https://www.psychometrica.de/effect_size.html. The independent and moderating variables can be found as follows:

    Sheet 5 (By-Notation):
    

    Model correctness and model completeness is compared by notation - UC, US.

    Sheet 6 (By-Case):
    

    Model correctness and model completeness is compared by case - SIM, HOS, IFA.

    Sheet 7 (By-Process):
    

    Model correctness and model completeness is compared by how well the derivation process is explained - well explained, partially explained, not present.

    Sheet 8 (By-Grade):
    

    Model correctness and model completeness is compared by the exam grades, converted to categorical values High, Low , and Medium.

  15. Data from: Does the Disclosure of Gun Ownership Affect Crime? Evidence from...

    • search.datacite.org
    • openicpsr.org
    • +1more
    Updated 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Tannenbaum (2018). Does the Disclosure of Gun Ownership Affect Crime? Evidence from New York [Dataset]. http://doi.org/10.3886/e109802v1
    Explore at:
    Dataset updated
    2018
    Dataset provided by
    Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
    DataCitehttps://www.datacite.org/
    Authors
    Daniel Tannenbaum
    Description

    This repository contains the data and code necessary to replicate all figures and tables in the working paper: "Does the disclosure of gun ownership affect crime? Evidence from New York" by Daniel Tannenbaum
    There are four folders in this repository:(1) Build: contains all the .do files required to produce the analysis datasets, using the raw data (i.e. datasets in the RawData folder).(2) Analysis: contains all the .do files required to produce all the figures and tables in the paper, using the analysis datasets (i.e. datasets in the AnalysisData folder).(3) RawData: contains all the raw datasets used to produce the AnalysisData datasets. The only raw dataset used in the paper that is excluded from this folder is the proprietary housing assessor and sales transaction data from DataQuick, owned by Corelogic. If I receive approval to include this raw data in this repository I will do so in future versions of this repository.(4) AnalysisData: contains all the analysis datasets that are created using the Build and are used to produce the tables and figures in the paper.

    Running the file Master_analysis.do in the Analysis folder will produce, in one script, all the tables and figures in the paper.

  16. m

    Global Burden of Disease analysis dataset of noncommunicable disease...

    • data.mendeley.com
    Updated Apr 6, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Cundiff (2023). Global Burden of Disease analysis dataset of noncommunicable disease outcomes, risk factors, and SAS codes [Dataset]. http://doi.org/10.17632/g6b39zxck4.10
    Explore at:
    Dataset updated
    Apr 6, 2023
    Authors
    David Cundiff
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This formatted dataset (AnalysisDatabaseGBD) originates from raw data files from the Institute of Health Metrics and Evaluation (IHME) Global Burden of Disease Study (GBD2017) affiliated with the University of Washington. We are volunteer collaborators with IHME and not employed by IHME or the University of Washington.

    The population weighted GBD2017 data are on male and female cohorts ages 15-69 years including noncommunicable diseases (NCDs), body mass index (BMI), cardiovascular disease (CVD), and other health outcomes and associated dietary, metabolic, and other risk factors. The purpose of creating this population-weighted, formatted database is to explore the univariate and multiple regression correlations of health outcomes with risk factors. Our research hypothesis is that we can successfully model NCDs, BMI, CVD, and other health outcomes with their attributable risks.

    These Global Burden of disease data relate to the preprint: The EAT-Lancet Commission Planetary Health Diet compared with Institute of Health Metrics and Evaluation Global Burden of Disease Ecological Data Analysis. The data include the following: 1. Analysis database of population weighted GBD2017 data that includes over 40 health risk factors, noncommunicable disease deaths/100k/year of male and female cohorts ages 15-69 years from 195 countries (the primary outcome variable that includes over 100 types of noncommunicable diseases) and over 20 individual noncommunicable diseases (e.g., ischemic heart disease, colon cancer, etc). 2. A text file to import the analysis database into SAS 3. The SAS code to format the analysis database to be used for analytics 4. SAS code for deriving Tables 1, 2, 3 and Supplementary Tables 5 and 6 5. SAS code for deriving the multiple regression formula in Table 4. 6. SAS code for deriving the multiple regression formula in Table 5 7. SAS code for deriving the multiple regression formula in Supplementary Table 7
    8. SAS code for deriving the multiple regression formula in Supplementary Table 8 9. The Excel files that accompanied the above SAS code to produce the tables

    For questions, please email davidkcundiff@gmail.com. Thanks.

  17. d

    PyTorch geometric datasets for morphVQ models

    • datadryad.org
    • data.niaid.nih.gov
    zip
    Updated Sep 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oshane Thomas; Hongyu Shen; Ryan L. Rauum; William E. H. Harcourt-Smith; John D. Polk; Mark Hasegawa-Johnson (2022). PyTorch geometric datasets for morphVQ models [Dataset]. http://doi.org/10.5061/dryad.bvq83bkcr
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 29, 2022
    Dataset provided by
    Dryad
    Authors
    Oshane Thomas; Hongyu Shen; Ryan L. Rauum; William E. H. Harcourt-Smith; John D. Polk; Mark Hasegawa-Johnson
    Time period covered
    Sep 2, 2022
    Description

    These datasets are customized Torch Geometric Datasets that contain raw .off polygon meshes as well as preprocessed .pt files needed for training morphVQ models. morphVQ can be found at https://github.com/oothomas/morphVQ.

  18. f

    Raw dataset for 90 tree individuals used for the statistical analysis.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Nov 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Donoso, David A.; Limberger, Oliver; Tiede, Yvonne; Bendix, Jörg; Schön, Jana E.; Homeier, Jürgen; Farwig, Nina; Becker, Marcel; Brandl, Roland (2023). Raw dataset for 90 tree individuals used for the statistical analysis. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000947352
    Explore at:
    Dataset updated
    Nov 7, 2023
    Authors
    Donoso, David A.; Limberger, Oliver; Tiede, Yvonne; Bendix, Jörg; Schön, Jana E.; Homeier, Jürgen; Farwig, Nina; Becker, Marcel; Brandl, Roland
    Description

    Raw dataset for 90 tree individuals used for the statistical analysis.

  19. Analysis of ITT-LOCF, LOCF, and completers for handling missing data in the...

    • plos.figshare.com
    xls
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mai A. Elobeid; Miguel A. Padilla; Theresa McVie; Olivia Thomas; David W. Brock; Bret Musser; Kaifeng Lu; Christopher S. Coffey; Renee A. Desmond; Marie-Pierre St-Onge; Kishore M. Gadde; Steven B. Heymsfield; David B. Allison (2023). Analysis of ITT-LOCF, LOCF, and completers for handling missing data in the 12 raw datasets using Multiple Imputation. [Dataset]. http://doi.org/10.1371/journal.pone.0006624.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Mai A. Elobeid; Miguel A. Padilla; Theresa McVie; Olivia Thomas; David W. Brock; Bret Musser; Kaifeng Lu; Christopher S. Coffey; Renee A. Desmond; Marie-Pierre St-Onge; Kishore M. Gadde; Steven B. Heymsfield; David B. Allison
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abbreviations: RCT, randomized controlled trial; MI, multiple imputation. Each permutation test is based on 10,000 permutations of each dataset.

  20. f

    Analysis of ITT-LOCF, LOCF, and completers for handling missing data in the...

    • plos.figshare.com
    • figshare.com
    xls
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mai A. Elobeid; Miguel A. Padilla; Theresa McVie; Olivia Thomas; David W. Brock; Bret Musser; Kaifeng Lu; Christopher S. Coffey; Renee A. Desmond; Marie-Pierre St-Onge; Kishore M. Gadde; Steven B. Heymsfield; David B. Allison (2023). Analysis of ITT-LOCF, LOCF, and completers for handling missing data in the 12 raw datasets using ordinary least squares. [Dataset]. http://doi.org/10.1371/journal.pone.0006624.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Mai A. Elobeid; Miguel A. Padilla; Theresa McVie; Olivia Thomas; David W. Brock; Bret Musser; Kaifeng Lu; Christopher S. Coffey; Renee A. Desmond; Marie-Pierre St-Onge; Kishore M. Gadde; Steven B. Heymsfield; David B. Allison
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abbreviations: ITT-LOCF, intent-to-treat-last observation carried forward; RCT, randomized controlled trial.aIndicates missing data pattern is the same for ITT-LOCF and LOCF. Each permutation test is based on 10,000 permutations of each dataset.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Abbas Salavaty Hosein Abadi; Sara Alaei; Mirana Ramialison; Peter Currie (2023). Raw data outputs 1-18 [Dataset]. http://doi.org/10.26180/21259491.v1

Raw data outputs 1-18

Related Article
Explore at:
xlsxAvailable download formats
Dataset updated
May 30, 2023
Dataset provided by
Monash University
Authors
Abbas Salavaty Hosein Abadi; Sara Alaei; Mirana Ramialison; Peter Currie
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Raw data outputs 1-18 Raw data output 1. Differentially expressed genes in AML CSCs compared with GTCs as well as in TCGA AML cancer samples compared with normal ones. This data was generated based on the results of AML microarray and TCGA data analysis. Raw data output 2. Commonly and uniquely differentially expressed genes in AML CSC/GTC microarray and TCGA bulk RNA-seq datasets. This data was generated based on the results of AML microarray and TCGA data analysis. Raw data output 3. Common differentially expressed genes between training and test set samples the microarray dataset. This data was generated based on the results of AML microarray data analysis. Raw data output 4. Detailed information on the samples of the breast cancer microarray dataset (GSE52327) used in this study. Raw data output 5. Differentially expressed genes in breast CSCs compared with GTCs as well as in TCGA BRCA cancer samples compared with normal ones. Raw data output 6. Commonly and uniquely differentially expressed genes in breast cancer CSC/GTC microarray and TCGA BRCA bulk RNA-seq datasets. This data was generated based on the results of breast cancer microarray and TCGA BRCA data analysis. CSC, and GTC are abbreviations of cancer stem cell, and general tumor cell, respectively. Raw data output 7. Differential and common co-expression and protein-protein interaction of genes between CSC and GTC samples. This data was generated based on the results of AML microarray and STRING database-based protein-protein interaction data analysis. CSC, and GTC are abbreviations of cancer stem cell, and general tumor cell, respectively. Raw data output 8. Differentially expressed genes between AML dormant and active CSCs. This data was generated based on the results of AML scRNA-seq data analysis. Raw data output 9. Uniquely expressed genes in dormant or active AML CSCs. This data was generated based on the results of AML scRNA-seq data analysis. Raw data output 10. Intersections between the targeting transcription factors of AML key CSC genes and differentially expressed genes between AML CSCs vs GTCs and between dormant and active AML CSCs or the uniquely expressed genes in either class of CSCs. Raw data output 11. Targeting desirableness score of AML key CSC genes and their targeting transcription factors. These scores were generated based on an in-house scoring function described in the Methods section. Raw data output 12. CSC-specific targeting desirableness score of AML key CSC genes and their targeting transcription factors. These scores were generated based on an in-house scoring function described in the Methods section. Raw data output 13. The protein-protein interactions between AML key CSC genes with themselves and their targeting transcription factors. This data was generated based on the results of AML microarray and STRING database-based protein-protein interaction data analysis. Raw data output 14. The previously confirmed associations of genes having the highest targeting desirableness and CSC-specific targeting desirableness scores with AML or other cancers’ (stem) cells as well as hematopoietic stem cells. These data were generated based on a PubMed database-based literature mining. Raw data output 15. Drug score of available drugs and bioactive small molecules targeting AML key CSC genes and/or their targeting transcription factors. These scores were generated based on an in-house scoring function described in the Methods section. Raw data output 16. CSC-specific drug score of available drugs and bioactive small molecules targeting AML key CSC genes and/or their targeting transcription factors. These scores were generated based on an in-house scoring function described in the Methods section. Raw data output 17. Candidate drugs for experimental validation. These drugs were selected based on their respective (CSC-specific) drug scores. CSC is the abbreviation of cancer stem cell. Raw data output 18. Detailed information on the samples of the AML microarray dataset GSE30375 used in this study.

Search
Clear search
Close search
Google apps
Main menu