36 datasets found
  1. Online Sales Dataset - Popular Marketplace Data

    • kaggle.com
    Updated May 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ShreyanshVerma27 (2024). Online Sales Dataset - Popular Marketplace Data [Dataset]. https://www.kaggle.com/datasets/shreyanshverma27/online-sales-dataset-popular-marketplace-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 25, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    ShreyanshVerma27
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset provides a comprehensive overview of online sales transactions across different product categories. Each row represents a single transaction with detailed information such as the order ID, date, category, product name, quantity sold, unit price, total price, region, and payment method.

    Columns:

    • Order ID: Unique identifier for each sales order.
    • Date:Date of the sales transaction.
    • Category:Broad category of the product sold (e.g., Electronics, Home Appliances, Clothing, Books, Beauty Products, Sports).
    • Product Name:Specific name or model of the product sold.
    • Quantity:Number of units of the product sold in the transaction.
    • Unit Price:Price of one unit of the product.
    • Total Price: Total revenue generated from the sales transaction (Quantity * Unit Price).
    • Region:Geographic region where the transaction occurred (e.g., North America, Europe, Asia).
    • Payment Method: Method used for payment (e.g., Credit Card, PayPal, Debit Card).

    Insights:

    • 1. Analyze sales trends over time to identify seasonal patterns or growth opportunities.
    • 2. Explore the popularity of different product categories across regions.
    • 3. Investigate the impact of payment methods on sales volume or revenue.
    • 4. Identify top-selling products within each category to optimize inventory and marketing strategies.
    • 5. Evaluate the performance of specific products or categories in different regions to tailor marketing campaigns accordingly.
  2. B

    Residential School Locations Dataset (CSV Format)

    • borealisdata.ca
    • search.dataone.org
    Updated Jun 5, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rosa Orlandini (2019). Residential School Locations Dataset (CSV Format) [Dataset]. http://doi.org/10.5683/SP2/RIYEMU
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 5, 2019
    Dataset provided by
    Borealis
    Authors
    Rosa Orlandini
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 1, 1863 - Jun 30, 1998
    Area covered
    Canada
    Description

    The Residential School Locations Dataset [IRS_Locations.csv] contains the locations (latitude and longitude) of Residential Schools and student hostels operated by the federal government in Canada. All the residential schools and hostels that are listed in the Indian Residential School Settlement Agreement are included in this dataset, as well as several Industrial schools and residential schools that were not part of the IRRSA. This version of the dataset doesn’t include the five schools under the Newfoundland and Labrador Residential Schools Settlement Agreement. The original school location data was created by the Truth and Reconciliation Commission, and was provided to the researcher (Rosa Orlandini) by the National Centre for Truth and Reconciliation in April 2017. The dataset was created by Rosa Orlandini, and builds upon and enhances the previous work of the Truth and Reconcilation Commission, Morgan Hite (creator of the Atlas of Indian Residential Schools in Canada that was produced for the Tk'emlups First Nation and Justice for Day Scholar's Initiative, and Stephanie Pyne (project lead for the Residential Schools Interactive Map). Each individual school location in this dataset is attributed either to RSIM, Morgan Hite, NCTR or Rosa Orlandini. Many schools/hostels had several locations throughout the history of the institution. If the school/hostel moved from its’ original location to another property, then the school is considered to have two unique locations in this dataset,the original location and the new location. For example, Lejac Indian Residential School had two locations while it was operating, Stuart Lake and Fraser Lake. If a new school building was constructed on the same property as the original school building, it isn't considered to be a new location, as is the case of Girouard Indian Residential School.When the precise location is known, the coordinates of the main building are provided, and when the precise location of the building isn’t known, an approximate location is provided. For each residential school institution location, the following information is provided: official names, alternative name, dates of operation, religious affiliation, latitude and longitude coordinates, community location, Indigenous community name, contributor (of the location coordinates), school/institution photo (when available), location point precision, type of school (hostel or residential school) and list of references used to determine the location of the main buildings or sites.

  3. d

    Replication Data for: Revisiting 'The Rise and Decline' in a Population of...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TeBlunthuis, Nathan; Aaron Shaw; Benjamin Mako Hill (2023). Replication Data for: Revisiting 'The Rise and Decline' in a Population of Peer Production Projects [Dataset]. http://doi.org/10.7910/DVN/SG3LP1
    Explore at:
    Dataset updated
    Nov 22, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    TeBlunthuis, Nathan; Aaron Shaw; Benjamin Mako Hill
    Description

    This archive contains code and data for reproducing the analysis for “Replication Data for Revisiting ‘The Rise and Decline’ in a Population of Peer Production Projects”. Depending on what you hope to do with the data you probabbly do not want to download all of the files. Depending on your computation resources you may not be able to run all stages of the analysis. The code for all stages of the analysis, including typesetting the manuscript and running the analysis, is in code.tar. If you only want to run the final analysis or to play with datasets used in the analysis of the paper, you want intermediate_data.7z or the uncompressed tab and csv files. The data files are created in a four-stage process. The first stage uses the program “wikiq” to parse mediawiki xml dumps and create tsv files that have edit data for each wiki. The second stage generates all.edits.RDS file which combines these tsvs into a dataset of edits from all the wikis. This file is expensive to generate and at 1.5GB is pretty big. The third stage builds smaller intermediate files that contain the analytical variables from these tsv files. The fourth stage uses the intermediate files to generate smaller RDS files that contain the results. Finally, knitr and latex typeset the manuscript. A stage will only run if the outputs from the previous stages do not exist. So if the intermediate files exist they will not be regenerated. Only the final analysis will run. The exception is that stage 4, fitting models and generating plots, always runs. If you only want to replicate from the second stage onward, you want wikiq_tsvs.7z. If you want to replicate everything, you want wikia_mediawiki_xml_dumps.7z.001 wikia_mediawiki_xml_dumps.7z.002, and wikia_mediawiki_xml_dumps.7z.003. These instructions work backwards from building the manuscript using knitr, loading the datasets, running the analysis, to building the intermediate datasets. Building the manuscript using knitr This requires working latex, latexmk, and knitr installations. Depending on your operating system you might install these packages in different ways. On Debian Linux you can run apt install r-cran-knitr latexmk texlive-latex-extra. Alternatively, you can upload the necessary files to a project on Overleaf.com. Download code.tar. This has everything you need to typeset the manuscript. Unpack the tar archive. On a unix system this can be done by running tar xf code.tar. Navigate to code/paper_source. Install R dependencies. In R. run install.packages(c("data.table","scales","ggplot2","lubridate","texreg")) On a unix system you should be able to run make to build the manuscript generalizable_wiki.pdf. Otherwise you should try uploading all of the files (including the tables, figure, and knitr folders) to a new project on Overleaf.com. Loading intermediate datasets The intermediate datasets are found in the intermediate_data.7z archive. They can be extracted on a unix system using the command 7z x intermediate_data.7z. The files are 95MB uncompressed. These are RDS (R data set) files and can be loaded in R using the readRDS. For example newcomer.ds <- readRDS("newcomers.RDS"). If you wish to work with these datasets using a tool other than R, you might prefer to work with the .tab files. Running the analysis Fitting the models may not work on machines with less than 32GB of RAM. If you have trouble, you may find the functions in lib-01-sample-datasets.R useful to create stratified samples of data for fitting models. See line 89 of 02_model_newcomer_survival.R for an example. Download code.tar and intermediate_data.7z to your working folder and extract both archives. On a unix system this can be done with the command tar xf code.tar && 7z x intermediate_data.7z. Install R dependencies. install.packages(c("data.table","ggplot2","urltools","texreg","optimx","lme4","bootstrap","scales","effects","lubridate","devtools","roxygen2")). On a unix system you can simply run regen.all.sh to fit the models, build the plots and create the RDS files. Generating datasets Building the intermediate files The intermediate files are generated from all.edits.RDS. This process requires about 20GB of memory. Download all.edits.RDS, userroles_data.7z,selected.wikis.csv, and code.tar. Unpack code.tar and userroles_data.7z. On a unix system this can be done using tar xf code.tar && 7z x userroles_data.7z. Install R dependencies. In R run install.packages(c("data.table","ggplot2","urltools","texreg","optimx","lme4","bootstrap","scales","effects","lubridate","devtools","roxygen2")). Run 01_build_datasets.R. Building all.edits.RDS The intermediate RDS files used in the analysis are created from all.edits.RDS. To replicate building all.edits.RDS, you only need to run 01_build_datasets.R when the int... Visit https://dataone.org/datasets/sha256%3Acfa4980c107154267d8eb6dc0753ed0fde655a73a062c0c2f5af33f237da3437 for complete metadata about this dataset.

  4. MaRV Scripts and Dataset

    • zenodo.org
    zip
    Updated Aug 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Henrique Nunes; Tushar Sharma; Eduardo Figueiredo; Henrique Nunes; Tushar Sharma; Eduardo Figueiredo (2025). MaRV Scripts and Dataset [Dataset]. http://doi.org/10.5281/zenodo.14450098
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 12, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Henrique Nunes; Tushar Sharma; Eduardo Figueiredo; Henrique Nunes; Tushar Sharma; Eduardo Figueiredo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Contacts:

    website: https://labsoft-ufmg.github.io/

    email: henrique.mg.bh@gmail.com

    The MaRV dataset consists of 693 manually evaluated code pairs extracted from 126 GitHub Java repositories, covering four types of refactoring. The dataset also includes metadata describing the refactored elements. Each code pair was assessed by two reviewers selected from a pool of 40 participants. The MaRV dataset is continuously evolving and is supported by a web-based tool for evaluating refactoring representations. This dataset aims to enhance the accuracy and reliability of state-of-the-art models in refactoring tasks, such as refactoring candidate identification and code generation, by providing high-quality annotated data.

    Our dataset is located at the path dataset/MaRV.json

    The guidelines for replicating the study are provided below:

    Requirements

    1. Software Dependencies:

    • Python 3.10+ with packages in requirements.txt
    • Git: Required to clone repositories.
    • Java 17: RefactoringMiner requires Java 17 to perform the analysis.
    • PHP 8.0: Required to host the Web tool.
    • MySQL 8: Required to store the Web tool data.

    2. Environment Variables:

    • Create a .env file based on .env.example in the src folder and set the variables:
      • CSV_PATH: Path to the CSV file containing the list of repositories to be processed.
      • CLONE_DIR: Directory where repositories will be cloned.
      • JAVA_PATH: Path to the Java executable.
      • REFACTORING_MINER_PATH: Path to RefactoringMiner.

    Refactoring Technique Selection

    1. Environment Setup:

    • Ensure all dependencies are installed. Install the required Python packages with:
      pip install -r requirements.txt
      

    2. Configuring the Repositories CSV:

    • The CSV file specified in CSV_PATH should contain a column named name with GitHub repository names (format: username/repo).

    3. Executing the Script:

    • Configure the environment variables in the .env file and set up the repositories CSV, then run:
      python3 src/run_rm.py
      
    • The RefactoringMiner output from the 126 repositories of our study is available at:
      https://zenodo.org/records/14395034

    4. Script Behavior:

    • The script clones each repository listed in the CSV file into the directory specified by CLONE_DIR, retrieves the default branch, and runs RefactoringMiner to analyze it.
    • Results and Logs:
      • Analysis results from RefactoringMiner are saved as .json files in CLONE_DIR.
      • Logs for each repository, including error messages, are saved as .log files in the same directory.

    5. Count Refactorings:

    • To count instances for each refactoring technique, run:
      python3 src/count_refactorings.py
      
    • The output CSV file, named refactoring_count_by_type_and_file, shows the number of refactorings for each technique, grouped by repository.

    Data Gathering

    • To collect snippets before and after refactoring and their metadata, run:

      python3 src/diff.py '[refactoring technique]'
      

      Replace [refactoring technique] with the desired technique name (e.g., Extract Method).

    • The script creates a directory for each repository and subdirectories named with the commit SHA. Each commit may have one or more refactorings.

    • Dataset Availability:

      • The snippets and metadata from the 126 repositories of our study are available in the dataset directory.
    • To generate the SQL file for the Web tool, run:

      python3 src/generate_refactorings_sql.py
      

    Web Tool for Manual Evaluation

    • The Web tool scripts are available in the web directory.
    • Populate the data/output/snippets folder with the output of src/diff.py.
    • Run the sql/create_database.sql script in your database.
    • Import the SQL file generated by src/generate_refactorings_sql.py.
    • Run dataset.php to generate the MaRV dataset file.
    • The MaRV dataset, generated by the Web tool, is available in the dataset directory of the replication package.
  5. ChesWILD attributes FY23.csv

    • hub.arcgis.com
    Updated Sep 24, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Fish & Wildlife Service (2024). ChesWILD attributes FY23.csv [Dataset]. https://hub.arcgis.com/datasets/fws::chesapeake-wild-program-grant-projects-2023?layer=1
    Explore at:
    Dataset updated
    Sep 24, 2024
    Dataset provided by
    U.S. Fish and Wildlife Servicehttp://www.fws.gov/
    Authors
    U.S. Fish & Wildlife Service
    Area covered
    Description

    Signed into law in October 2020, the Chesapeake Watershed Investments for Landscape Defense (Chesapeake WILD) Act directed the Secretary of the Interior, through the U.S. Fish and Wildlife Service, to develop and implement the non-regulatory Chesapeake WILD Program. Chesapeake WILD was established for the following purposes: 1. Coordination among federal, state, local, and regional entities to establish a shared vision for sustaining natural resources and human communities throughout the Chesapeake Bay and its watershed, 2. Engagement of diverse agencies and organizations to build capacity and generate funding that address shared restoration and conservation priorities, and 3. Collaboration to administer a grant program and implement projects to conserve, steward, and enhance fish and wildlife habitats and related conservation values.This dataset represents the 25 projects awarded more than $7.4 million that generated $12 million in match from the grantees, providing a total conservation impact of $19.4 million for conservation investments support unmet place-based needs that align with five shared pillars. Spatial data are collected from grantees in a polygon geometry. The accuracy of polygons that are submitted depend on the methodology used by grantees to create them. Some grantees have a GIS team that uploads detailed project footprints developed by experienced GIS analysts. Others create project footprints during the grant application process using a map-based online drawing tool. All polygons are reviewed and approved by relevant program staff to ensure proper scale.Points are generated from polygons submitted by grantees. If there is one polygon, the point is placed in the centroid of the polygon. If there are multiple polygons that are within close proximity to each other, one point is generated as a representative point of the multiple polygons. If there are multiple polygons that are far away from each other, multiple points are created (the centroid of each of the polygons). Consequently, polygons are recommended for quantitative analyses requiring a high spatial resolution, and points are recommended for mapping the general location of many projects over large areas. - this means there are some duplicate points on the map.Please take the accuracy, methodology, and guidelines into consideration as you conduct your analyses. These data are intended for your internal analysis and the use case outlined above only. Spatial data should not be shared with other third party entities without approval from the National Fish and Wildlife Foundation. If you have any questions about the data, please feel free to contact to Mike Lagua (Michael.Lagua@nfwf.org).

  6. Mathematical Problems Dataset: Various

    • kaggle.com
    Updated Dec 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Mathematical Problems Dataset: Various [Dataset]. https://www.kaggle.com/datasets/thedevastator/mathematical-problems-dataset-various-mathematic
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 2, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Mathematical Problems Dataset: Various Mathematical Problems and Solutions

    Mathematical Problems Dataset: Questions and Answers

    By math_dataset (From Huggingface) [source]

    About this dataset

    This dataset comprises a collection of mathematical problems and their solutions designed for training and testing purposes. Each problem is presented in the form of a question, followed by its corresponding answer. The dataset covers various mathematical topics such as arithmetic, polynomials, and prime numbers. For instance, the arithmetic_nearest_integer_root_test.csv file focuses on problems involving finding the nearest integer root of a given number. Similarly, the polynomials_simplify_power_test.csv file deals with problems related to simplifying polynomials with powers. Additionally, the dataset includes the numbers_is_prime_train.csv file containing math problems that require determining whether a specific number is prime or not. The questions and answers are provided in text format to facilitate analysis and experimentation with mathematical problem-solving algorithms or models

    How to use the dataset

    • Introduction: The Mathematical Problems Dataset contains a collection of various mathematical problems and their corresponding solutions or answers. This guide will provide you with all the necessary information on how to utilize this dataset effectively.

    • Understanding the columns: The dataset consists of several columns, each representing a different aspect of the mathematical problem and its solution. The key columns are:

      • question: This column contains the text representation of the mathematical problem or equation.
      • answer: This column contains the text representation of the solution or answer to the corresponding problem.
    • Exploring specific problem categories: To focus on specific types of mathematical problems, you can filter or search within the dataset using relevant keywords or terms related to your area of interest. For example, if you are interested in prime numbers, you can search for prime in the question column.

    • Applying machine learning techniques: This dataset can be used for training machine learning models related to natural language understanding and mathematics. You can explore various techniques such as text classification, sentiment analysis, or even sequence-to-sequence models for solving mathematical problems based on their textual representations.

    • Generating new questions and solutions: By analyzing patterns in this dataset, you can generate new questions and solutions programmatically using techniques like data augmentation or rule-based methods.

    • Validation and evaluation: As with any other machine learning task, it is essential to validate your models on separate validation sets not included in this dataset properly. You can also evaluate model performance by comparing predictions against known answers provided in this dataset's answer column.

    • Sharing insights and findings: After working with this datasets, it would be beneficial for researchers or educators to share their insights, approaches taken during analysis/modelling as Kaggle notebooks/ discussions/ blogs/ tutorials etc., so that others could get benefited from such shared resources too.

    Note: Please note that the dataset does not include dates.

    By following these guidelines, you can effectively explore and utilize the Mathematical Problems Dataset for various mathematical problem-solving tasks. Happy exploring!

    Research Ideas

    • Developing machine learning algorithms for solving mathematical problems: This dataset can be used to train and test models that can accurately predict the solution or answer to different mathematical problems.
    • Creating educational resources: The dataset can be used to create a wide variety of educational materials such as problem sets, worksheets, and quizzes for students studying mathematics.
    • Research in mathematical problem-solving strategies: Researchers and educators can analyze the dataset to identify common patterns or strategies employed in solving different types of mathematical problems. This analysis can help improve teaching methodologies and develop effective problem-solving techniques

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purpos...

  7. Additional file 1: of KinMap: a web-based tool for interactive navigation...

    • springernature.figshare.com
    zip
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sameh Eid; Samo Turk; Andrea Volkamer; Friedrich Rippmann; Simone Fulle (2023). Additional file 1: of KinMap: a web-based tool for interactive navigation through human kinome data [Dataset]. http://doi.org/10.6084/m9.figshare.c.3658955_D1.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Sameh Eid; Samo Turk; Andrea Volkamer; Friedrich Rippmann; Simone Fulle
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    (KinMap_Examples.zip) contains the input CSV files used to generate the annotated kinome trees in Fig. 1 (Example_1_Erlotinib_NSCLC.csv), Fig. 2a (Example_2_Sunitinib_Sorafenib_Cancer.csv), and Fig. 2b (Example_3_Kinase_Stats.csv). (ZIP 5 kb)

  8. f

    Pulsar Voices

    • figshare.com
    pdf
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Richard Ferrers; Anderson Murray; Ben Raymond; Gary Ruben; CHRISTOPHER RUSSELL; Sarath Tomy; Michael Walker (2023). Pulsar Voices [Dataset]. http://doi.org/10.6084/m9.figshare.3084748.v2
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    figshare
    Authors
    Richard Ferrers; Anderson Murray; Ben Raymond; Gary Ruben; CHRISTOPHER RUSSELL; Sarath Tomy; Michael Walker
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The data is sourced from CSIRO Parkes ATNF.eg http://www.atnf.csiro.au/research/pulsar/psrcat/Feel the pulse of the universeWe're taking signal data from astronomical "pulsar" sources and creating a way to listen to their signals audibly.Pulsar data is available from ATNF at CSIRO.au. Our team at #SciHackMelb has been working on a #datavis to give researchers and others a novel way to explore the Pulsar corpus, especially through the sound of the frequencies at which the Pulsars emit pulses.Link to project page at #SciHackMelb - http://www.the-hackfest.com/events/melbourne-science-hackfest/projects/pulsar-voices/The files attached here include: source data, project presentation, data as used in website final_pulsar.sql, and other methodology documentation. Importantly, see the Github link which contains data manipulation code, html code to present the data, and render audibly, iPython Notebook to process single pulsar data into an audible waveform file. Together all these resources are the Pulsar Voices activity and resulting data.Source Data;* RA - east/west coordinates (0 - 24 hrs, roughly equates to longitude) [theta; transforms RA to 0 - 360*]* Dec - north/south coordinates (-90, +90 roughly equates to latitude i.e. 90 is above north pole, and -90 south pole)* P0 - the time in seconds that a pulsar repeats its signal* f - 1/P0 which ranges from 700 cycles per sec, to some which pulses which occur every few seconds* kps - distance from Earth in kilo-parsecs. 1 kps = 3,000 light years. The furthest data is 30 kps. The galactic centre is about 25,000 light years away i.e. about 8kps.psrcatShort.csv = 2,295 Pulsars all known pulsars with above fields; RA, Dec, ThetapsrcatMedium.csv - add P0 and kps, only 1428 lines - i.e. not available for all 2,295 datapointpsrcatSparse.csv - add P0 and kps, banks if n/a, 2,295 linesshort.txt - important pulsars with high levels of observation (** even more closely examined)pulsar.R - code contributed by Ben Raymond to visualise Pulsar frequency, period in histogrampulsarVoices_authors.JPG - added photo of authors from SciHackMelbAdded to the raw data:- Coordinates to map RA, Dec to screen width(y)/height(x)y = RA[Theta]*width/360; x = (Dec + 90)*height/180- audible frequency converted from Pulsar frequency (1/P0)Formula for 1/P0(x) -> Hz(y) => y = 10 ^ (0.5 log(x) + 2.8)Explanation in text file; Convert1/P0toHz.txtTone generator from: http://www.softsynth.com/webaudio/tone.php- detailed waveform file audible converted from Pulsar signal data, and waveform image (and python notebook to generate; available):The project source is hosted on github at:https://github.com/gazzar/pulsarvoicesAn IPython/Jupyter notebook contains code and a rough description of the method used to process a psrfits .sf filedownloaded via the CSIRO Data Access Portal at http://doi.org/10.4225/08/55940087706E1The notebook contains experimental code to read one of these .sf files and access the contained spectrogram data, processing it to generate an audible signal.It also reads the .txt files containing columnar pulse phase data (which is also contained in the .sf files) and processes these by frequency modulating the signal with an audible carrier.This is the method used to generate the .wav and .png files used in the web interface.https://github.com/gazzar/pulsarvoices/blob/master/ipynb/hackfest1.ipynb A standalone python script that does the .txt to .png and .wav signal processing was used to process 15 more pulsar data examples. These can be reproduced by running the script.https://github.com/gazzar/pulsarvoices/blob/master/data/pulsarvoices.pyProcessed file at: https://github.com/gazzar/pulsarvoices/tree/master/webhttps://github.com/gazzar/pulsarvoices/blob/master/web/J0437-4715.pngJ0437-4715.wav | J0437-4715.png)#Datavis online at: http://checkonline.com.au/tooltip.php. Code at Github linked above. See especially:https://github.com/gazzar/pulsarvoices/blob/master/web/index.phpparticularly, lines 314 - 328 (or search: "SELECT * FROM final_pulsar";) which loads pulsar data from DB and push to screen with Hz on mouseover.Pulsar Voices webpage Functions:1.There is sound when you run the mouse across the Pulsars. We plot all known pulsars (N=2,295), and play a tone for pulsars we had data on frequency i.e. about 75%.2. In the bottom left corner a more detailed Pulsar sound, and wave image pops up when you click the star icon. Two of the team worked exclusively on turning a single pulsars waveform into an audible wav file. They created 16 of these files, and a workflow, but the team only had time to load one waveform. With more time, it would be great to load these files.3. If you leave the mouse over a Pulsar, a little data description pops up, with location (RA, Dec), distance (kilo parsecs; 1 = 3,000 light years), and frequency of rotation (and Hz converted to human hearing).4.If you click on a Pulsar, other pulsars with similar frequency are highlighted in white. With more time I was interested to see if there are harmonics between pulsars. i.e. related frequencies.The TeamMichael Walker is: orcid.org/0000-0003-3086-6094 ; Biosciences PhD student, Unimelb, Melbourne.Richard Ferrers is: orcid.org/0000-0002-2923-9889 ; ANDS Research Data Analyst, Innovation/Value Researcher, Melbourne.Sarath Tomy is: http://orcid.org/0000-0003-4301-0690 ; La Trobe PhD Comp Sci, Melbourne.Gary Ruben is: http://orcid.org/0000-0002-6591-1820 ; CSIRO Postdoc at Australian Synchrotron, Melbourne.Christopher Russell is: Data Manager, CSIRO, Sydney.https://wiki.csiro.au/display/ASC/Chris+RussellAnderson Murray is: orcid.org/0000-0001-6986-9140; Physics Honours, Monash, Melbourne.Contact: richard.ferrers@ands.org.au for more information.What is still left to do?* load data, description, images fileset to figshare :: DOI ; DONE except DOI* add overview images as option eg frequency bi-modal histogram* colour code pulsars by distance; DONE* add pulsar detail sound to Top three Observants; 16 pulsars processed but not loaded* add tones to pulsars to indicate f; DONE* add tooltips to show location, distance, frequency, name; DONE* add title and description; DONE* project data onto a planetarium dome with interaction to play pulsar frequencies.DONE see youtube video at https://youtu.be/F119gqOKJ1U* zoom into parts of sky to get separation between close data points - see youtube; function in Google Earth #datavis of dataset. Link at youtube.* set upper and lower tone boundaries, so tones aren't annoying* colour code pulsars by frequency bins e.g. >100 Hz, 10 - 100, 1 - 10,

  9. s

    Data from: Fair Notification Optimization: An Auction Approach

    • socialmediaarchive.org
    csv, txt
    Updated Sep 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Fair Notification Optimization: An Auction Approach [Dataset]. https://socialmediaarchive.org/record/55
    Explore at:
    csv(5520707), csv(7527757), txt(4189), csv(51401577)Available download formats
    Dataset updated
    Sep 15, 2023
    Description

    Notifications are important for the user experience in mobile apps and can influence their engagement. However, too many notifications can be disruptive for users. In this work, we study a novel centralized approach for notification optimization, where we view the opportunities to send user notifications as items and types of notifications as buyers in an auction market.

    The full dataset, instagram_notification_auction_base_dataset.csv, contains all generated notifications for a subset of Instagram users across four notification types within a certain time window. Each entry of the dataset represents one generated notification. For each generated notification, we include some information related to the notification as well as information related to the auctions performed to determine if the generated notification can be sent to users. See the README file for detailed column decriptions. The dataset was collected during an A/B test where we compare the performance of the first-price auction system with that of the second-price auction system.

    The two derived datasets can be useful to study fair online allocation and Fisher market equilibrium. See the README for details and a link to the scripts that generate the derived datasets.

  10. Raw data, RCWA code, COMSOL model for paper "Chiral resonant modes induced...

    • figshare.com
    csv
    Updated Jan 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bo Wang (2025). Raw data, RCWA code, COMSOL model for paper "Chiral resonant modes induced by intrinsic birefringence in lithium niobate metasurfaces" [Dataset]. http://doi.org/10.6084/m9.figshare.28189436.v2
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jan 12, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Bo Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The CSV files contain the raw data, which are used to generate Figures in the paper. LN_chiral_T.py is the python code for RCWA solver. All the transmission spectra in the paper are calculated based on LN_chiral_T.py. LN_chiral_modeanalysis.mph is a model of the COMSOL Multiphysics. The resonant frequencies and damping rates, electric field distributions, and optical chiral densities are simulated by using LN_chiral_modeanalysis.mph.In Figure 1, Fig1b.csv file is the raw data of Figure 1(b), calculated with LN_chiral_T.py. Fig1c_sim.csv and Fig1d_sim.csv are the raw data of the 'triangle maker' curves in Figure 1(c) and 1(d), which are generated from the LN_chiral_modeanalysis.mph. Fig1c_anal.csv and Fig1d_anal.csv are the raw data of the solid curves in Figure 1(c) and 1(d), which calculated using the Eq. (3) of the main text. Figure 1(e) is generated from the COMSOL model LN_chiral_modeanalysis.mph. In Figure 2, Fig2a.csv is the raw data of Figure 2(a), which are calculated using the Eq. (3) of the main text. Fig2b.csv is the raw data of Figure 2(b), which is generated from LN_chiral_modeanalysis.mph. Figure 2(c) is generated from the COMSOL model LN_chiral_modeanalysis.mph.In Figure 3, Fig3a.csv and Fig3b.csv are raw data of Figure 3(a) and Figure 3(b), which are calculated with RCWA code LN_chiral_T.py. Fig3c.csv and Fig3d.csv are raw data of Figure 3(c) and Figure 3(d), which are generated based on Figure 3(a) and Figure 3(b). Figure 3(e) is produced by LN_chiral_modeanalysis.mph. Fig3f.csv is the raw data of Figure 3(f), which is generated from LN_chiral_modeanalysis.mph.In Figure 4, Fig4b.csv and Fig4d.csv are raw experimental data of Figure 4(b) and Figure 4(d). Fig4c_blue.csv and Fig4c_red.csv are raw data of Figure 4(c), which is calculated based on Figure 4(b) and Figure 4(d).To run LN_chiral_modeanalysis.mph, COMSOL Multiphysics software with version 5.6 or higher should be installed. To run LN_chiral_T.py, rcwa solver S4 (https://web.stanford.edu/group/fan/S4/) should be installed.

  11. Airline Dataset

    • kaggle.com
    Updated Sep 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sourav Banerjee (2023). Airline Dataset [Dataset]. https://www.kaggle.com/datasets/iamsouravbanerjee/airline-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 26, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sourav Banerjee
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Airline data holds immense importance as it offers insights into the functioning and efficiency of the aviation industry. It provides valuable information about flight routes, schedules, passenger demographics, and preferences, which airlines can leverage to optimize their operations and enhance customer experiences. By analyzing data on delays, cancellations, and on-time performance, airlines can identify trends and implement strategies to improve punctuality and mitigate disruptions. Moreover, regulatory bodies and policymakers rely on this data to ensure safety standards, enforce regulations, and make informed decisions regarding aviation policies. Researchers and analysts use airline data to study market trends, assess environmental impacts, and develop strategies for sustainable growth within the industry. In essence, airline data serves as a foundation for informed decision-making, operational efficiency, and the overall advancement of the aviation sector.

    Content

    This dataset comprises diverse parameters relating to airline operations on a global scale. The dataset prominently incorporates fields such as Passenger ID, First Name, Last Name, Gender, Age, Nationality, Airport Name, Airport Country Code, Country Name, Airport Continent, Continents, Departure Date, Arrival Airport, Pilot Name, and Flight Status. These columns collectively provide comprehensive insights into passenger demographics, travel details, flight routes, crew information, and flight statuses. Researchers and industry experts can leverage this dataset to analyze trends in passenger behavior, optimize travel experiences, evaluate pilot performance, and enhance overall flight operations.

    Dataset Glossary (Column-wise)

    • Passenger ID - Unique identifier for each passenger
    • First Name - First name of the passenger
    • Last Name - Last name of the passenger
    • Gender - Gender of the passenger
    • Age - Age of the passenger
    • Nationality - Nationality of the passenger
    • Airport Name - Name of the airport where the passenger boarded
    • Airport Country Code - Country code of the airport's location
    • Country Name - Name of the country the airport is located in
    • Airport Continent - Continent where the airport is situated
    • Continents - Continents involved in the flight route
    • Departure Date - Date when the flight departed
    • Arrival Airport - Destination airport of the flight
    • Pilot Name - Name of the pilot operating the flight
    • Flight Status - Current status of the flight (e.g., on-time, delayed, canceled)

    Structure of the Dataset

    https://i.imgur.com/cUFuMeU.png" alt="">

    Acknowledgement

    The dataset provided here is a simulated example and was generated using the online platform found at Mockaroo. This web-based tool offers a service that enables the creation of customizable Synthetic datasets that closely resemble real data. It is primarily intended for use by developers, testers, and data experts who require sample data for a range of uses, including testing databases, filling applications with demonstration data, and crafting lifelike illustrations for presentations and tutorials. To explore further details, you can visit their website.

    Cover Photo by: Kevin Woblick on Unsplash

    Thumbnail by: Airplane icons created by Freepik - Flaticon

  12. d

    Synthetic Retail Transactions (Worldwide) — Zero-PII, ML-Ready CSV/Parquet...

    • datarade.ai
    Updated Sep 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zalingo AI (2025). Synthetic Retail Transactions (Worldwide) — Zero-PII, ML-Ready CSV/Parquet for AOV, Demand & CLV POCs [Dataset]. https://datarade.ai/data-products/synthetic-retail-transactions-worldwide-zero-pii-ml-read-zalingo-ai
    Explore at:
    .json, .csv, .parquet, .pdfAvailable download formats
    Dataset updated
    Sep 4, 2025
    Dataset authored and provided by
    Zalingo AI
    Area covered
    Saint Pierre and Miquelon, Aruba, Antarctica, Benin, British Indian Ocean Territory, Lithuania, Thailand, Togo, Gabon, Bangladesh
    Description

    What this is A privacy-safe synthetic retail transactions dataset with realistic baskets and prices, zero PII, and production-style schema so you can unblock AOV/CLV/demand/promo POCs immediately.

    What you get - Line-item transactions with quantity, unit_price, discount, location, timestamp - Derived metric: total_value per line/basket - Formats: CSV and Parquet - Docs: data dictionary + daily manifest; optional locked schema - Delivery: secure S3 link or private share (Snowflake/Databricks/BigQuery) on request

    Why synthetic? - Zero PII and GDPR-friendly experimentation - Start modeling today while prod access is pending - Stable distributions for benchmarking algorithms and pipelines

    Known limits - Not a drop-in replacement for your production POS/e-com telemetry - Aggregate patterns approximate real-world behaviour; individual rows are synthetic

    Sample (2,000 rows) We include a 2,000-row CSV that mirrors the production schema (nulls 0% across sample columns; unique transaction_id; ISO8601 timestamps). Add your own summary stats (AOV, discount rate, category mix) if desired.

    Support & customizations Need region/category mixes, holiday uplift, or promo elasticity scenarios? We can generate tailored cohorts and refresh cadences.

  13. i

    VPN-nonVPN dataset

    • impactcybertrust.org
    Updated Jan 19, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    External Data Source (2019). VPN-nonVPN dataset [Dataset]. http://doi.org/10.23721/100/1478793
    Explore at:
    Dataset updated
    Jan 19, 2019
    Authors
    External Data Source
    Description

    To generate a representative dataset of real-world traffic in ISCX we defined a set of tasks, assuring that our dataset is rich enough in diversity and quantity. We created accounts for users Alice and Bob in order to use services like Skype, Facebook, etc. Below we provide the complete list of different types of traffic and applications considered in our dataset for each traffic type (VoIP, P2P, etc.)

    We captured a regular session and a session over VPN, therefore we have a total of 14 traffic categories: VOIP, VPN-VOIP, P2P, VPN-P2P, etc. We also give a detailed description of the different types of traffic generated:

    Browsing: Under this label we have HTTPS traffic generated by users while browsing or performing any task that includes the use of a browser. For instance, when we captured voice-calls using hangouts, even though browsing is not the main activity, we captured several browsing flows.

    Email: The traffic samples generated using a Thunderbird client, and Alice and Bob Gmail accounts. The clients were configured to deliver mail through SMTP/S, and receive it using POP3/SSL in one client and IMAP/SSL in the other.

    Chat: The chat label identifies instant-messaging applications. Under this label we have Facebook and Hangouts via web browsers, Skype, and IAM and ICQ using an application called pidgin [14].

    Streaming: The streaming label identifies multimedia applications that require a continuous and steady stream of data. We captured traffic from Youtube (HTML5 and flash versions) and Vimeo services using Chrome and Firefox.

    File Transfer: This label identifies traffic applications whose main purpose is to send or receive files and documents. For our dataset we captured Skype file transfers, FTP over SSH (SFTP) and FTP over SSL (FTPS) traffic sessions.

    VoIP: The Voice over IP label groups all traffic generated by voice applications. Within this label we captured voice calls using Facebook, Hangouts and Skype.

    TraP2P: This label is used to identify file-sharing protocols like Bittorrent. To generate this traffic we downloaded different .torrent files from a public a repository and captured traffic sessions using the uTorrent and Transmission applications.

    The traffic was captured using Wireshark and tcpdump, generating a total amount of 28GB of data. For the VPN, we used an external VPN service provider and connected to it using OpenVPN (UDP mode). To generate SFTP and FTPS traffic we also used an external service provider and Filezilla as a client.

    To facilitate the labeling process, when capturing the traffic all unnecessary services and applications were closed. (The only application executed was the objective of the capture, e.g., Skype voice-call, SFTP file transfer, etc.) We used a filter to capture only the packets with source or destination IP, the address of the local client (Alice or Bob).

    The full research paper outlining the details of the dataset and its underlying principles:

    Gerard Drapper Gil, Arash Habibi Lashkari, Mohammad Mamun, Ali A. Ghorbani, "Characterization of Encrypted and VPN Traffic Using Time-Related Features", In Proceedings of the 2nd International Conference on Information Systems Security and Privacy(ICISSP 2016) , pages 407-414, Rome, Italy.
    ISCXFlowMeter has been written in Java for reading the pcap files and create the csv file based on selected features. The UNB ISCX Network Traffic (VPN-nonVPN) dataset consists of labeled network traffic, including full packet in pcap format and csv (flows generated by ISCXFlowMeter) also are publicly available for researchers.

    For more information contact cic@unb.ca.

    The UNB ISCX Network Traffic Dataset content
    Traffic: Content
    Web Browsing: Firefox and Chrome
    Email: SMPTS, POP3S and IMAPS
    Chat: ICQ, AIM, Skype, Facebook and Hangouts
    Streaming: Vimeo and Youtube
    File Transfer: Skype, FTPS and SFTP using Filezilla and an external service
    VoIP: Facebook, Skype and Hangouts voice calls (1h duration)
    P2P: uTorrent and Transmission (Bittorrent)
    ; cic@unb.ca.

  14. S&P 500 stock data

    • kaggle.com
    zip
    Updated Aug 11, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cam Nugent (2017). S&P 500 stock data [Dataset]. https://www.kaggle.com/camnugent/sandp500
    Explore at:
    zip(31994392 bytes)Available download formats
    Dataset updated
    Aug 11, 2017
    Authors
    Cam Nugent
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Stock market data can be interesting to analyze and as a further incentive, strong predictive models can have large financial payoff. The amount of financial data on the web is seemingly endless. A large and well structured dataset on a wide array of companies can be hard to come by. Here I provide a dataset with historical stock prices (last 5 years) for all companies currently found on the S&P 500 index.

    The script I used to acquire all of these .csv files can be found in this GitHub repository In the future if you wish for a more up to date dataset, this can be used to acquire new versions of the .csv files.

    Content

    The data is presented in a couple of formats to suit different individual's needs or computational limitations. I have included files containing 5 years of stock data (in the all_stocks_5yr.csv and corresponding folder) and a smaller version of the dataset (all_stocks_1yr.csv) with only the past year's stock data for those wishing to use something more manageable in size.

    The folder individual_stocks_5yr contains files of data for individual stocks, labelled by their stock ticker name. The all_stocks_5yr.csv and all_stocks_1yr.csv contain this same data, presented in merged .csv files. Depending on the intended use (graphing, modelling etc.) the user may prefer one of these given formats.

    All the files have the following columns: Date - in format: yy-mm-dd Open - price of the stock at market open (this is NYSE data so all in USD) High - Highest price reached in the day Low Close - Lowest price reached in the day Volume - Number of shares traded Name - the stock's ticker name

    Acknowledgements

    I scraped this data from Google finance using the python library 'pandas_datareader'. Special thanks to Kaggle, Github and The Market.

    Inspiration

    This dataset lends itself to a some very interesting visualizations. One can look at simple things like how prices change over time, graph an compare multiple stocks at once, or generate and graph new metrics from the data provided. From these data informative stock stats such as volatility and moving averages can be easily calculated. The million dollar question is: can you develop a model that can beat the market and allow you to make statistically informed trades!

  15. Datasets for Sentiment Analysis

    • zenodo.org
    csv
    Updated Dec 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias (2023). Datasets for Sentiment Analysis [Dataset]. http://doi.org/10.5281/zenodo.10157504
    Explore at:
    csvAvailable download formats
    Dataset updated
    Dec 10, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository was created for my Master's thesis in Computational Intelligence and Internet of Things at the University of Córdoba, Spain. The purpose of this repository is to store the datasets found that were used in some of the studies that served as research material for this Master's thesis. Also, the datasets used in the experimental part of this work are included.

    Below are the datasets specified, along with the details of their references, authors, and download sources.

    ----------- STS-Gold Dataset ----------------

    The dataset consists of 2026 tweets. The file consists of 3 columns: id, polarity, and tweet. The three columns denote the unique id, polarity index of the text and the tweet text respectively.

    Reference: Saif, H., Fernandez, M., He, Y., & Alani, H. (2013). Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold.

    File name: sts_gold_tweet.csv

    ----------- Amazon Sales Dataset ----------------

    This dataset is having the data of 1K+ Amazon Product's Ratings and Reviews as per their details listed on the official website of Amazon. The data was scraped in the month of January 2023 from the Official Website of Amazon.

    Owner: Karkavelraja J., Postgraduate student at Puducherry Technological University (Puducherry, Puducherry, India)

    Features:

    • product_id - Product ID
    • product_name - Name of the Product
    • category - Category of the Product
    • discounted_price - Discounted Price of the Product
    • actual_price - Actual Price of the Product
    • discount_percentage - Percentage of Discount for the Product
    • rating - Rating of the Product
    • rating_count - Number of people who voted for the Amazon rating
    • about_product - Description about the Product
    • user_id - ID of the user who wrote review for the Product
    • user_name - Name of the user who wrote review for the Product
    • review_id - ID of the user review
    • review_title - Short review
    • review_content - Long review
    • img_link - Image Link of the Product
    • product_link - Official Website Link of the Product

    License: CC BY-NC-SA 4.0

    File name: amazon.csv

    ----------- Rotten Tomatoes Reviews Dataset ----------------

    This rating inference dataset is a sentiment classification dataset, containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. On average, these reviews consist of 21 words. The first 5331 rows contains only negative samples and the last 5331 rows contain only positive samples, thus the data should be shuffled before usage.

    This data is collected from https://www.cs.cornell.edu/people/pabo/movie-review-data/ as a txt file and converted into a csv file. The file consists of 2 columns: reviews and labels (1 for fresh (good) and 0 for rotten (bad)).

    Reference: Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 115–124, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics

    File name: data_rt.csv

    ----------- Preprocessed Dataset Sentiment Analysis ----------------

    Preprocessed amazon product review data of Gen3EcoDot (Alexa) scrapped entirely from amazon.in
    Stemmed and lemmatized using nltk.
    Sentiment labels are generated using TextBlob polarity scores.

    The file consists of 4 columns: index, review (stemmed and lemmatized review using nltk), polarity (score) and division (categorical label generated using polarity score).

    DOI: 10.34740/kaggle/dsv/3877817

    Citation: @misc{pradeesh arumadi_2022, title={Preprocessed Dataset Sentiment Analysis}, url={https://www.kaggle.com/dsv/3877817}, DOI={10.34740/KAGGLE/DSV/3877817}, publisher={Kaggle}, author={Pradeesh Arumadi}, year={2022} }

    This dataset was used in the experimental phase of my research.

    File name: EcoPreprocessed.csv

    ----------- Amazon Earphones Reviews ----------------

    This dataset consists of a 9930 Amazon reviews, star ratings, for 10 latest (as of mid-2019) bluetooth earphone devices for learning how to train Machine for sentiment analysis.

    This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.

    The file consists of 5 columns: ReviewTitle, ReviewBody, ReviewStar, Product and division (manually added - categorical label generated using ReviewStar score)

    License: U.S. Government Works

    Source: www.amazon.in

    File name (original): AllProductReviews.csv (contains 14337 reviews)

    File name (edited - used for my research) : AllProductReviews2.csv (contains 9930 reviews)

    ----------- Amazon Musical Instruments Reviews ----------------

    This dataset contains 7137 comments/reviews of different musical instruments coming from Amazon.

    This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.

    The file consists of 10 columns: reviewerID, asin (ID of the product), reviewerName, helpful (helpfulness rating of the review), reviewText, overall (rating of the product), summary (summary of the review), unixReviewTime (time of the review - unix time), reviewTime (time of the review (raw) and division (manually added - categorical label generated using overall score).

    Source: http://jmcauley.ucsd.edu/data/amazon/

    File name (original): Musical_instruments_reviews.csv (contains 10261 reviews)

    File name (edited - used for my research) : Musical_instruments_reviews2.csv (contains 7137 reviews)

  16. f

    Data_Sheet_1_Integrative Differential Expression Analysis for Multiple...

    • frontiersin.figshare.com
    txt
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Verónica Jiménez-Jacinto; Alejandro Sanchez-Flores; Leticia Vega-Alvarado (2023). Data_Sheet_1_Integrative Differential Expression Analysis for Multiple EXperiments (IDEAMEX): A Web Server Tool for Integrated RNA-Seq Data Analysis.CSV [Dataset]. http://doi.org/10.3389/fgene.2019.00279.s001
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Frontiers
    Authors
    Verónica Jiménez-Jacinto; Alejandro Sanchez-Flores; Leticia Vega-Alvarado
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The current DNA sequencing technologies and their high-throughput yield, allowed the thrive of genomic and transcriptomic experiments but it also have generated big data problem. Due to this exponential growth of sequencing data, also the complexity of managing, processing and interpreting it in order to generate results, has raised. Therefore, the demand of easy-to-use friendly software and websites to run bioinformatic tools is imminent. In particular, RNA-Seq and differential expression analysis have become a popular and useful method to evaluate the genetic expression change in any organism. However, many scientists struggle with the data analysis since most of the available tools are implemented in a UNIX-based environment. Therefore, we have developed the web server IDEAMEX (Integrative Differential Expression Analysis for Multiple EXperiments). The IDEAMEX pipeline needs a raw count table for as many desired replicates and conditions, allowing the user to select which conditions will be compared, instead of doing all-vs.-all comparisons. The whole process consists of three main steps (1) Data Analysis: that allows a preliminary analysis for quality control based on the data distribution per sample, using different types of graphs; (2) Differential expression: performs the differential expression analysis with or without batch effect error awareness, using the bioconductor packages, NOISeq, limma-Voom, DESeq2 and edgeR, and generate reports for each method; (3) Result integration: the obtained results the integrated results are reported using different graphical outputs such as correlograms, heatmaps, Venn diagrams and text lists. Our server allows an easy and friendly visualization for results, providing an easy interaction during the analysis process, as well as error tracking and debugging by providing output log files. The server is currently available and can be accessed at http://www.uusmb.unam.mx/ideamex/ where the documentation and example input files are provided. We consider that this web server can help other researchers with no previous bioinformatic knowledge, to perform their analyses in a simple manner.

  17. r

    CALY-SWE: Discrete choice experiment and time trade-off data for a...

    • researchdata.se
    • data.europa.eu
    Updated Sep 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaspar Walter Meili; Lars Lindholm (2024). CALY-SWE: Discrete choice experiment and time trade-off data for a representative Swedish value set [Dataset]. http://doi.org/10.5878/asxy-3p37
    Explore at:
    Dataset updated
    Sep 24, 2024
    Dataset provided by
    Umeå University
    Authors
    Kaspar Walter Meili; Lars Lindholm
    Time period covered
    Jan 8, 2022 - Apr 18, 2022
    Area covered
    Sweden
    Description

    The data consist of two parts: Time trade-off (TTO) data with one row per TTO question (5 questions), and discrete choice experiment (DCE) data with one row per question (6 questions). The purpose of the data is the calculation of a Swedish value set for the capability-adjusted life years (CALY-SWE) instrument. To protect the privacy of the study participants and to comply with GDPR, access to the data is given upon request.

    The data is provided in 4 .csv files with the names:

    • tto.csv (252 kB)
    • dce.csv (282 kB)
    • weights_final_model.csv (30 kB)
    • coefs_final_model.csv (1 kB)

    The first two files (tto.csv, dce.csv) contain the time trade-off (TTO) answers and discrete choice experiment (DCE) answers of participants. The latter two files (weight_final_model.csv, coefs_final_model.csv) contain the generated value set of CALY-SWE weights, and the pertaining coefficients of the main effects additive model.

    Background:

    CALY-SWE is a capability-based instrument for studying Quality of Life (QoL). It consists of 6 attributes (health, social relations, financial situation & housing, occupation, security, political & civil rights) and provides the option to gives for attribute answers on 3 levels (Agree, Agree partially, Do not agree). A configuration or state is one of the 3^6 = 729 possible situations that the instrument describes. Here, a config is denoted in the form of xxxxxx, one x for each attribute in order above. X is a digit corresponding to the level of the respective attribute, with 3 being the highest (Agree), and 1 being the lowest (Do not agree). For example, 222222 encodes a configuration with all attributes on level 2 (Partially agree). The purpose of this dataset is to support the publication of the CALY-SWE value set and to enable reproduction of the calculations (due to privacy concerns we abstain from publishing individual level characteristics). A value set consists of values on the 0 to 1 scale for all 729, each of represents a quality weighting where 1 is the highest capability-related QoL, and 0 the lowest capability-related QoL.

    The data contains answers to two types of questions: TTO and DCE.

    In TTO questions, participants iteratively chose a number of years between 1 to 10. A choice of 10 years is equivalent to living 10 years with full capability (state configuration 333333) in the capability state that the TTO question describes. The answer on the 0 to 1 scale is then calculated as x/10. In the DCE questions, participants were given two states and they chose a state that they found to be better. We used a hybrid model with a linear regression and a logit model component, where the coefficients were linked through a multiplicative factor, to obtain the weights (weights_final_model.csv). Each weight is calculated as constant + the coefficients for the respective configuration. Coefficients for level 3 encode the difference to level 2, and coefficients for level 2 the difference to the constant. For example, for the weight for 123112 is calculated as constant + socrel2 + finhou2 + finhou3 + polciv2 (No coefficients for health, occupation, and security involved as they are on level 1 that is captured in the constant/intercept).

    To assess the quality of TTO answers, we calculated a score per participant that takes into account inconsistencies in answering the TTO question. We then excluded 20% of participants with the worst score to improve the TTO data quality and signal strength for the model (this is indicated by the 'included' variable in the TTO dataset). Details of the entire survey are described in the preprint “CALY-SWE value set: An integrated approach for a valuation study based on an online-administered TTO and DCE survey” by Meili et al. (2023). Please check this document for updated versions.

    Ids have been randomized with preserved linkage between the DCE and TTO dataset.

    Data files and variables:

    Below is a description of the variables in each CSV file. - tto.csv:

    config: 6 numbers representing the attribute levels. position: The number of the asked TTO question. tto_block: The design block of the TTO question. answer: The equivalence value indicated by the participant, ranging from 0.1 to 1 in steps of 0.1. included: If the answer was included in the data for the model to generate the value set. id: Randomized id of the participant.

    • dce.csv:

    config1: Configuration of the first state in the question. config2: Configuration of the second state in the question. position: The number of the asked TTO question. answer: Whether state 1 or 2 was preferred. id: Randomized id of the participant.

    • weights_final_model.csv

    config: 6 numbers representing the attribute levels. weight: The weight calculated with the final model. ciu: The upper 95% credible interval. cil: The lower 95% credible interval.

    • coefs_final_model.csv:

    name: Name of the coefficient, composed of an abbreviation for the attribute and a level number (abbreviations in the same order as above: health, socrel, finhou, occu, secu, polciv). value: Continuous, weight on the 0 to 1 scale. ciu: The upper 95% credible interval. cil: The lower 95% credible interval.

  18. f

    Data_Sheet_1_Development and External Validation of a Dynamic Nomogram With...

    • frontiersin.figshare.com
    txt
    Updated Jun 4, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TingTing Chen; WeiGen Xiong; ZhiHong Zhao; YaJie Shan; XueMei Li; LeHeng Guo; Lan Xiang; Dong Chu; HongWei Fan; YingBin Li; JianJun Zou (2023). Data_Sheet_1_Development and External Validation of a Dynamic Nomogram With Potential for Risk Assessment of Ruptured Multiple Intracranial Aneurysms.CSV [Dataset]. http://doi.org/10.3389/fneur.2022.797709.s001
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    Frontiers
    Authors
    TingTing Chen; WeiGen Xiong; ZhiHong Zhao; YaJie Shan; XueMei Li; LeHeng Guo; Lan Xiang; Dong Chu; HongWei Fan; YingBin Li; JianJun Zou
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Background and PurposeAbout 20.1% of intracranial aneurysms (IAs) carriers are multiple intracranial aneurysms (MIAs) patients with higher rupture risk and worse prognosis. A prediction model may bring some potential benefits. This study attempted to develop and externally validate a dynamic nomogram to assess the rupture risk of each IA among patients with MIA.MethodWe retrospectively analyzed the data of 262 patients with 611 IAs admitted to the Hunan Provincial People's Hospital between November 2015 and November 2021. Multivariable logistic regression (MLR) was applied to select the risk factors and derive a nomogram model for the assessment of IA rupture risk in MIA patients. To externally validate the nomogram, data of 35 patients with 78 IAs were collected from another independent center between December 2009 and May 2021. The performance of the nomogram was assessed in terms of discrimination, calibration, and clinical utility.ResultSize, location, irregular shape, diabetes history, and neck width were independently associated with IA rupture. The nomogram showed a good discriminative ability for ruptured and unruptured IAs in the derivation cohort (AUC = 0.81; 95% CI, 0.774–0.847) and was successfully generalized in the external validation cohort (AUC = 0.744; 95% CI, 0.627–0.862). The nomogram was calibrated well, and the decision curve analysis showed that it would generate more net benefit in identifying IA rupture than the “treat all” or “treat none” strategies at the threshold probabilities ranging from 10 to 60% both in the derivation and external validation set. The web-based dynamic nomogram calculator was accessible on https://wfs666.shinyapps.io/onlinecalculator/.ConclusionExternal validation has shown that the model was the potential to assist clinical identification of dangerous aneurysms after longitudinal data evaluation. Size, neck width, and location are the primary risk factors for ruptured IAs.

  19. Data from: Current and projected research data storage needs of Agricultural...

    • catalog.data.gov
    • agdatacommons.nal.usda.gov
    • +2more
    Updated Apr 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Current and projected research data storage needs of Agricultural Research Service researchers in 2016 [Dataset]. https://catalog.data.gov/dataset/current-and-projected-research-data-storage-needs-of-agricultural-research-service-researc-f33da
    Explore at:
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    Agricultural Research Servicehttps://www.ars.usda.gov/
    Description

    The USDA Agricultural Research Service (ARS) recently established SCINet , which consists of a shared high performance computing resource, Ceres, and the dedicated high-speed Internet2 network used to access Ceres. Current and potential SCINet users are using and generating very large datasets so SCINet needs to be provisioned with adequate data storage for their active computing. It is not designed to hold data beyond active research phases. At the same time, the National Agricultural Library has been developing the Ag Data Commons, a research data catalog and repository designed for public data release and professional data curation. Ag Data Commons needs to anticipate the size and nature of data it will be tasked with handling. The ARS Web-enabled Databases Working Group, organized under the SCINet initiative, conducted a study to establish baseline data storage needs and practices, and to make projections that could inform future infrastructure design, purchases, and policies. The SCINet Web-enabled Databases Working Group helped develop the survey which is the basis for an internal report. While the report was for internal use, the survey and resulting data may be generally useful and are being released publicly. From October 24 to November 8, 2016 we administered a 17-question survey (Appendix A) by emailing a Survey Monkey link to all ARS Research Leaders, intending to cover data storage needs of all 1,675 SY (Category 1 and Category 4) scientists. We designed the survey to accommodate either individual researcher responses or group responses. Research Leaders could decide, based on their unit's practices or their management preferences, whether to delegate response to a data management expert in their unit, to all members of their unit, or to themselves collate responses from their unit before reporting in the survey. Larger storage ranges cover vastly different amounts of data so the implications here could be significant depending on whether the true amount is at the lower or higher end of the range. Therefore, we requested more detail from "Big Data users," those 47 respondents who indicated they had more than 10 to 100 TB or over 100 TB total current data (Q5). All other respondents are called "Small Data users." Because not all of these follow-up requests were successful, we used actual follow-up responses to estimate likely responses for those who did not respond. We defined active data as data that would be used within the next six months. All other data would be considered inactive, or archival. To calculate per person storage needs we used the high end of the reported range divided by 1 for an individual response, or by G, the number of individuals in a group response. For Big Data users we used the actual reported values or estimated likely values. Resources in this dataset:Resource Title: Appendix A: ARS data storage survey questions. File Name: Appendix A.pdfResource Description: The full list of questions asked with the possible responses. The survey was not administered using this PDF but the PDF was generated directly from the administered survey using the Print option under Design Survey. Asterisked questions were required. A list of Research Units and their associated codes was provided in a drop down not shown here. Resource Software Recommended: Adobe Acrobat,url: https://get.adobe.com/reader/ Resource Title: CSV of Responses from ARS Researcher Data Storage Survey. File Name: Machine-readable survey response data.csvResource Description: CSV file includes raw responses from the administered survey, as downloaded unfiltered from Survey Monkey, including incomplete responses. Also includes additional classification and calculations to support analysis. Individual email addresses and IP addresses have been removed. This information is that same data as in the Excel spreadsheet (also provided).Resource Title: Responses from ARS Researcher Data Storage Survey. File Name: Data Storage Survey Data for public release.xlsxResource Description: MS Excel worksheet that Includes raw responses from the administered survey, as downloaded unfiltered from Survey Monkey, including incomplete responses. Also includes additional classification and calculations to support analysis. Individual email addresses and IP addresses have been removed.Resource Software Recommended: Microsoft Excel,url: https://products.office.com/en-us/excel

  20. Data from: Public Notification

    • anla-esp-esri-co.hub.arcgis.com
    • city-of-lawrenceville-arcgis-hub-lville.hub.arcgis.com
    Updated Feb 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    esri_en (2023). Public Notification [Dataset]. https://anla-esp-esri-co.hub.arcgis.com/items/4799f2bd805c4d7bab4cfc90c96e1c38
    Explore at:
    Dataset updated
    Feb 23, 2023
    Dataset provided by
    Esrihttp://esri.com/
    Authors
    esri_en
    Description

    Use the Public Notification template to allow users to create a list of features that they can export to a CSV or PDF file. They can create lists of features by searching for a single location, drawing an area of interest to include intersecting features, or using the geometry of an existing feature as the area of interest. Users can include a search buffer around a defined area of interest to expand the list of features. Examples: Export a CSV file with addresses for residents to alert about road closures. Create an inventory of contact information for parents in a school district. Generate a PDF file to print address labels and the corresponding map for community members. Data requirements The Public Notification template requires a feature layer to use all of its capabilities. Key app capabilities Search radius - Define a distance for a search buffer that selects intersecting input features to include in the list. Export - Save the results from the lists created in the app. Users can export the data to CSV and PDF format. Refine selection - Allow users to revise the selected features in the lists they create by adding or removing features with sketch tools. Sketch tools - Draw graphics on the map to select features to add to a list. Users can also use features from a layer to select intersecting features from the input layer. Home, Zoom controls, Legend, Layer List, Search Supportability This web app is designed responsively to be used in browsers on desktops, mobile phones, and tablets. We are committed to ongoing efforts towards making our apps as accessible as possible. Please feel free to leave a comment on how we can improve the accessibility of our apps for those who use assistive technologies.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
ShreyanshVerma27 (2024). Online Sales Dataset - Popular Marketplace Data [Dataset]. https://www.kaggle.com/datasets/shreyanshverma27/online-sales-dataset-popular-marketplace-data
Organization logo

Online Sales Dataset - Popular Marketplace Data

Global Transactions Across Various Product Categories

Explore at:
3 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 25, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
ShreyanshVerma27
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

This dataset provides a comprehensive overview of online sales transactions across different product categories. Each row represents a single transaction with detailed information such as the order ID, date, category, product name, quantity sold, unit price, total price, region, and payment method.

Columns:

  • Order ID: Unique identifier for each sales order.
  • Date:Date of the sales transaction.
  • Category:Broad category of the product sold (e.g., Electronics, Home Appliances, Clothing, Books, Beauty Products, Sports).
  • Product Name:Specific name or model of the product sold.
  • Quantity:Number of units of the product sold in the transaction.
  • Unit Price:Price of one unit of the product.
  • Total Price: Total revenue generated from the sales transaction (Quantity * Unit Price).
  • Region:Geographic region where the transaction occurred (e.g., North America, Europe, Asia).
  • Payment Method: Method used for payment (e.g., Credit Card, PayPal, Debit Card).

Insights:

  • 1. Analyze sales trends over time to identify seasonal patterns or growth opportunities.
  • 2. Explore the popularity of different product categories across regions.
  • 3. Investigate the impact of payment methods on sales volume or revenue.
  • 4. Identify top-selling products within each category to optimize inventory and marketing strategies.
  • 5. Evaluate the performance of specific products or categories in different regions to tailor marketing campaigns accordingly.
Search
Clear search
Close search
Google apps
Main menu