100+ datasets found
  1. CSV file used in statistical analyses

    • data.csiro.au
    • researchdata.edu.au
    • +1more
    Updated Oct 13, 2014
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CSIRO (2014). CSV file used in statistical analyses [Dataset]. http://doi.org/10.4225/08/543B4B4CA92E6
    Explore at:
    Dataset updated
    Oct 13, 2014
    Dataset authored and provided by
    CSIROhttp://www.csiro.au/
    License

    https://research.csiro.au/dap/licences/csiro-data-licence/https://research.csiro.au/dap/licences/csiro-data-licence/

    Time period covered
    Mar 14, 2008 - Jun 9, 2009
    Dataset funded by
    CSIROhttp://www.csiro.au/
    Description

    A csv file containing the tidal frequencies used for statistical analyses in the paper "Estimating Freshwater Flows From Tidally-Affected Hydrographic Data" by Dan Pagendam and Don Percival.

  2. GitTables 1M - CSV files

    • zenodo.org
    zip
    Updated Jun 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Madelon Hulsebos; Çağatay Demiralp; Paul Groth; Madelon Hulsebos; Çağatay Demiralp; Paul Groth (2022). GitTables 1M - CSV files [Dataset]. http://doi.org/10.5281/zenodo.6515973
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 6, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Madelon Hulsebos; Çağatay Demiralp; Paul Groth; Madelon Hulsebos; Çağatay Demiralp; Paul Groth
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset contains >800K CSV files behind the GitTables 1M corpus.

    For more information about the GitTables corpus, visit:

    - our website for GitTables, or

    - the main GitTables download page on Zenodo.

  3. Adventure Works 2022 CSVs

    • kaggle.com
    zip
    Updated Nov 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Algorismus (2022). Adventure Works 2022 CSVs [Dataset]. https://www.kaggle.com/datasets/algorismus/adventure-works-in-excel-tables
    Explore at:
    zip(567646 bytes)Available download formats
    Dataset updated
    Nov 2, 2022
    Authors
    Algorismus
    License

    http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html

    Description

    Adventure Works 2022 dataset

    How this Dataset is created?

    On the official website the dataset is available over SQL server (localhost) and CSVs to be used via Power BI Desktop running on Virtual Lab (Virtaul Machine). As per first two steps of Importing data are executed in the virtual lab and then resultant Power BI tables are copied in CSVs. Added records till year 2022 as required.

    How this Dataset may help you?

    this dataset will be helpful in case you want to work offline with Adventure Works data in Power BI desktop in order to carry lab instructions as per training material on official website. The dataset is useful in case you want to work on Power BI desktop Sales Analysis example from Microsoft website PL 300 learning.

    How to use this Dataset?

    Download the CSV file(s) and import in Power BI desktop as tables. The CSVs are named as tables created after first two steps of importing data as mentioned in the PL-300 Microsoft Power BI Data Analyst exam lab.

  4. B

    Residential School Locations Dataset (CSV Format)

    • borealisdata.ca
    • search.dataone.org
    Updated Jun 5, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rosa Orlandini (2019). Residential School Locations Dataset (CSV Format) [Dataset]. http://doi.org/10.5683/SP2/RIYEMU
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 5, 2019
    Dataset provided by
    Borealis
    Authors
    Rosa Orlandini
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 1, 1863 - Jun 30, 1998
    Area covered
    Canada
    Description

    The Residential School Locations Dataset [IRS_Locations.csv] contains the locations (latitude and longitude) of Residential Schools and student hostels operated by the federal government in Canada. All the residential schools and hostels that are listed in the Indian Residential School Settlement Agreement are included in this dataset, as well as several Industrial schools and residential schools that were not part of the IRRSA. This version of the dataset doesn’t include the five schools under the Newfoundland and Labrador Residential Schools Settlement Agreement. The original school location data was created by the Truth and Reconciliation Commission, and was provided to the researcher (Rosa Orlandini) by the National Centre for Truth and Reconciliation in April 2017. The dataset was created by Rosa Orlandini, and builds upon and enhances the previous work of the Truth and Reconcilation Commission, Morgan Hite (creator of the Atlas of Indian Residential Schools in Canada that was produced for the Tk'emlups First Nation and Justice for Day Scholar's Initiative, and Stephanie Pyne (project lead for the Residential Schools Interactive Map). Each individual school location in this dataset is attributed either to RSIM, Morgan Hite, NCTR or Rosa Orlandini. Many schools/hostels had several locations throughout the history of the institution. If the school/hostel moved from its’ original location to another property, then the school is considered to have two unique locations in this dataset,the original location and the new location. For example, Lejac Indian Residential School had two locations while it was operating, Stuart Lake and Fraser Lake. If a new school building was constructed on the same property as the original school building, it isn't considered to be a new location, as is the case of Girouard Indian Residential School.When the precise location is known, the coordinates of the main building are provided, and when the precise location of the building isn’t known, an approximate location is provided. For each residential school institution location, the following information is provided: official names, alternative name, dates of operation, religious affiliation, latitude and longitude coordinates, community location, Indigenous community name, contributor (of the location coordinates), school/institution photo (when available), location point precision, type of school (hostel or residential school) and list of references used to determine the location of the main buildings or sites.

  5. Datasets for Sentiment Analysis

    • zenodo.org
    csv
    Updated Dec 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias (2023). Datasets for Sentiment Analysis [Dataset]. http://doi.org/10.5281/zenodo.10157504
    Explore at:
    csvAvailable download formats
    Dataset updated
    Dec 10, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository was created for my Master's thesis in Computational Intelligence and Internet of Things at the University of Córdoba, Spain. The purpose of this repository is to store the datasets found that were used in some of the studies that served as research material for this Master's thesis. Also, the datasets used in the experimental part of this work are included.

    Below are the datasets specified, along with the details of their references, authors, and download sources.

    ----------- STS-Gold Dataset ----------------

    The dataset consists of 2026 tweets. The file consists of 3 columns: id, polarity, and tweet. The three columns denote the unique id, polarity index of the text and the tweet text respectively.

    Reference: Saif, H., Fernandez, M., He, Y., & Alani, H. (2013). Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold.

    File name: sts_gold_tweet.csv

    ----------- Amazon Sales Dataset ----------------

    This dataset is having the data of 1K+ Amazon Product's Ratings and Reviews as per their details listed on the official website of Amazon. The data was scraped in the month of January 2023 from the Official Website of Amazon.

    Owner: Karkavelraja J., Postgraduate student at Puducherry Technological University (Puducherry, Puducherry, India)

    Features:

    • product_id - Product ID
    • product_name - Name of the Product
    • category - Category of the Product
    • discounted_price - Discounted Price of the Product
    • actual_price - Actual Price of the Product
    • discount_percentage - Percentage of Discount for the Product
    • rating - Rating of the Product
    • rating_count - Number of people who voted for the Amazon rating
    • about_product - Description about the Product
    • user_id - ID of the user who wrote review for the Product
    • user_name - Name of the user who wrote review for the Product
    • review_id - ID of the user review
    • review_title - Short review
    • review_content - Long review
    • img_link - Image Link of the Product
    • product_link - Official Website Link of the Product

    License: CC BY-NC-SA 4.0

    File name: amazon.csv

    ----------- Rotten Tomatoes Reviews Dataset ----------------

    This rating inference dataset is a sentiment classification dataset, containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. On average, these reviews consist of 21 words. The first 5331 rows contains only negative samples and the last 5331 rows contain only positive samples, thus the data should be shuffled before usage.

    This data is collected from https://www.cs.cornell.edu/people/pabo/movie-review-data/ as a txt file and converted into a csv file. The file consists of 2 columns: reviews and labels (1 for fresh (good) and 0 for rotten (bad)).

    Reference: Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 115–124, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics

    File name: data_rt.csv

    ----------- Preprocessed Dataset Sentiment Analysis ----------------

    Preprocessed amazon product review data of Gen3EcoDot (Alexa) scrapped entirely from amazon.in
    Stemmed and lemmatized using nltk.
    Sentiment labels are generated using TextBlob polarity scores.

    The file consists of 4 columns: index, review (stemmed and lemmatized review using nltk), polarity (score) and division (categorical label generated using polarity score).

    DOI: 10.34740/kaggle/dsv/3877817

    Citation: @misc{pradeesh arumadi_2022, title={Preprocessed Dataset Sentiment Analysis}, url={https://www.kaggle.com/dsv/3877817}, DOI={10.34740/KAGGLE/DSV/3877817}, publisher={Kaggle}, author={Pradeesh Arumadi}, year={2022} }

    This dataset was used in the experimental phase of my research.

    File name: EcoPreprocessed.csv

    ----------- Amazon Earphones Reviews ----------------

    This dataset consists of a 9930 Amazon reviews, star ratings, for 10 latest (as of mid-2019) bluetooth earphone devices for learning how to train Machine for sentiment analysis.

    This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.

    The file consists of 5 columns: ReviewTitle, ReviewBody, ReviewStar, Product and division (manually added - categorical label generated using ReviewStar score)

    License: U.S. Government Works

    Source: www.amazon.in

    File name (original): AllProductReviews.csv (contains 14337 reviews)

    File name (edited - used for my research) : AllProductReviews2.csv (contains 9930 reviews)

    ----------- Amazon Musical Instruments Reviews ----------------

    This dataset contains 7137 comments/reviews of different musical instruments coming from Amazon.

    This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.

    The file consists of 10 columns: reviewerID, asin (ID of the product), reviewerName, helpful (helpfulness rating of the review), reviewText, overall (rating of the product), summary (summary of the review), unixReviewTime (time of the review - unix time), reviewTime (time of the review (raw) and division (manually added - categorical label generated using overall score).

    Source: http://jmcauley.ucsd.edu/data/amazon/

    File name (original): Musical_instruments_reviews.csv (contains 10261 reviews)

    File name (edited - used for my research) : Musical_instruments_reviews2.csv (contains 7137 reviews)

  6. github-final-datasets

    • kaggle.com
    zip
    Updated Nov 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Olga Ivanova (2023). github-final-datasets [Dataset]. https://www.kaggle.com/datasets/olgaiv39/github-final-datasets
    Explore at:
    zip(1877861953 bytes)Available download formats
    Dataset updated
    Nov 9, 2023
    Authors
    Olga Ivanova
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Github Clean Code Snippets Dataset

    Here is a description, how the datasets for a training notebook used for Telegram ML Contest solution were prepared.

    1 Step - Github Samples Database parsing

    The first part of the code samples was taken from a private version of this notebook.

    Here is the statistics about classes of programming languages from Github Code Snippets database https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F833757%2F2fdc091661198e80559f8cb1d1a306ff%2FScreenshot%202023-11-07%20at%2021.24.42.png?generation=1699390166413391&alt=media" alt="">

    From this database, 2 csv files were created - with 50000 code samples for each of the 20 programming languages included, with equal by numbers and stratified sampling. The files related here are sample_equal_prop_50000.csv and sample_equal_prop_50000.csv and sample_stratified_50000.csv, respectively.

    2 Step - Github Bigquery Database parsing

    Second option for capturing out additional examples was to run this notebook with making up larger amount of queries, 10000.

    The resulted file is dataset-10000.csv - included to the data card

    The statistics for the code programming languages is as on the next chart - it has 32 labeled classes
    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F833757%2F7c04342da8ec1df266cd90daf00204f9%2FScreenshot%202023-10-13%20at%2020.52.13.png?generation=1699392769199533&alt=media" alt="">

    3 Step - collection of code samples of raw coding samples

    To get a model more robust, code samples of 20 additional languages were collected in amount from 10 till 15 samples on more-less popular use cases. Also, for the class "OTHER", like regular language examples, according to the task of the competition, the text examples from this dataset with promts on Huggingface were added to the file. The resulted file here is rare_languages.csv - also in data card

    The statistics for rare languages code snippets is as follows: https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F833757%2F0b340781c774d2acb988ce1567f4afa3%2FScreenshot%202023-11-08%20at%2001.13.07.png?generation=1699402436798661&alt=media" alt="">

    4 Step - First and second datasets combining

    For this stage of dataset creation, the number of the columns in sample_equal_prop_50000.csv and sample_stratified_50000.csv was cut out just for 2 - "snippet", "language", the version of file with equal numbers is in the data card - sample_equal_prop_50000_clean.csv

    To prepare Bigquery dataset file, the column with index was cut out, and the column "content" was renamed to "snippet". These changes were saved in dataset-10000-clean.csv

    After that, the files sample_equal_prop_50000_clean.csv and dataset-10000-clean.csv were combined together and saved as github-combined-file.csv

    5 Step - Datasets cleaning from symbols and merging together with rare languages

    The prepared files took too much RAM to be read by Pandas library, so that is why additional prepocessing has been made - the symbols like quatas, commas, ampersands, new lines and adding tabs characters were cleaned out. After clieaning, the flies were merged with rare_languages.csv file and saved as github-combined-file-no-symbols-rare-clean.csv and sample_equal_prop_50000_-no-symbols-rare-clean.csv, respectively.

    The final distribution of classes turned out to be the next one https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F833757%2Ff43e0cea4c565c9f7c808527b0dfa2da%2FScreenshot%202023-11-09%20at%2020.26.30.png?generation=1699558064765454&alt=media" alt="">

    6 Step - Fixing up the labels

    To be suitable for TF-DF format, to each programming language a certain label was given as well. The final labels are in the data card.

  7. _labels1.csv. This data set representss the label of the corresponding...

    • figshare.com
    txt
    Updated Oct 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    naillah gul (2023). _labels1.csv. This data set representss the label of the corresponding samples in data.csv file [Dataset]. http://doi.org/10.6084/m9.figshare.24270088.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Oct 9, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    naillah gul
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The datasets contain pixel-level hyperspectral data of six snow and glacier classes. They have been extracted from a Hyperspectral image. The dataset "data.csv" has 5417 * 142 samples belonging to the classes: Clean snow, Dirty ice, Firn, Glacial ice, Ice mixed debris, and Water body. The dataset "_labels1.csv" has corresponding labels of the "data.csv" file. The dataset "RGB.csv" has only 5417 * 3 samples. There are only three band values in this file while "data.csv" has 142 band values.

  8. 1000 Empirical Time series

    • figshare.com
    • bridges.monash.edu
    • +1more
    png
    Updated May 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ben Fulcher (2023). 1000 Empirical Time series [Dataset]. http://doi.org/10.6084/m9.figshare.5436136.v10
    Explore at:
    pngAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Ben Fulcher
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A diverse selection of 1000 empirical time series, along with results of an hctsa feature extraction, using v1.06 of hctsa and Matlab 2019b, computed on a server at The University of Sydney.The results of the computation are in the hctsa file, HCTSA_Empirical1000.mat for use in Matlab using v1.06 of hctsa.The same data is also provided in .csv format for the hctsa_datamatrix.csv (results of feature computation), with information about rows (time series) in hctsa_timeseries-info.csv, information about columns (features) in hctsa_features.csv (and corresponding hctsa code used to compute each feature in hctsa_masterfeatures.csv), and the data of individual time series (each line a time series, for time series described in hctsa_timeseries-info.csv) is in hctsa_timeseries-data.csv. These .csv files were produced by running >>OutputToCSV(HCTSA_Empirical1000.mat,true,true); in hctsa.The input file, INP_Empirical1000.mat, is for use with hctsa, and contains the time-series data and metadata for the 1000 time series. For example, massive feature extraction from these data on the user's machine, using hctsa, can proceed as>> TS_Init('INP_Empirical1000.mat');Some visualizations of the dataset are in CarpetPlot.png (first 1000 samples of all time series as a carpet (color) plot) and 150TS-250samples.png (conventional time-series plots of the first 250 samples of a sample of 150 time series from the dataset). More visualizations can be performed by the user using TS_PlotTimeSeries from the hctsa package.See links in references for more comprehensive documentation for performing methodological comparison using this dataset, and on how to download and use v1.06 of hctsa.

  9. D

    Walmart data in CSV format

    • dataandsons.com
    csv, zip
    Updated Aug 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    crawl feeds (2022). Walmart data in CSV format [Dataset]. https://www.dataandsons.com/categories/product-lists/walmart-data-in-csv-format
    Explore at:
    csv, zipAvailable download formats
    Dataset updated
    Aug 15, 2022
    Dataset provided by
    Data & Sons
    Authors
    crawl feeds
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Time period covered
    Aug 15, 2022
    Description

    About this Dataset

    Walmart data in CSV format extracted by crawl feeds team using in-house tools. Last extracted on 15 Aug 2022.

    Category

    Product Lists

    Keywords

    Walmart dataset,retail datasets,ecommerce datasets

    Row Count

    10

    Price

    Free

  10. m

    Ransomware and user samples for training and validating ML models

    • data.mendeley.com
    Updated Sep 17, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eduardo Berrueta (2021). Ransomware and user samples for training and validating ML models [Dataset]. http://doi.org/10.17632/yhg5wk39kf.2
    Explore at:
    Dataset updated
    Sep 17, 2021
    Authors
    Eduardo Berrueta
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Ransomware is considered as a significant threat for most enterprises since past few years. In scenarios wherein users can access all files on a shared server, one infected host is capable of locking the access to all shared files. In the article related to this repository, we detect ransomware infection based on file-sharing traffic analysis, even in the case of encrypted traffic. We compare three machine learning models and choose the best for validation. We train and test the detection model using more than 70 ransomware binaries from 26 different families and more than 2500 h of ‘not infected’ traffic from real users. The results reveal that the proposed tool can detect all ransomware binaries, including those not used in the training phase (zero-days). This paper provides a validation of the algorithm by studying the false positive rate and the amount of information from user files that the ransomware could encrypt before being detected.

    This dataset directory contains the 'infected' and 'not infected' samples and the models used for each T configuration, each one in a separated folder.

    The folders are named NxSy where x is the number of 1-second interval per sample and y the sliding step in seconds.

    Each folder (for example N10S10/) contains: - tree.py -> Python script with the Tree model. - ensemble.json -> JSON file with the information about the Ensemble model. - NN_XhiddenLayer.json -> JSON file with the information about the NN model with X hidden layers (1, 2 or 3). - N10S10.csv -> All samples used for training each model in this folder. It is in csv format for using in bigML application. - zeroDays.csv -> All zero-day samples used for testing each model in this folder. It is in csv format for using in bigML application. - userSamples_test -> All samples used for validating each model in this folder. It is in csv format for using in bigML application. - userSamples_train -> User samples used for training the models. - ransomware_train -> Ransomware samples used for training the models - scaler.scaler -> Standard Scaler from python library used for scale the samples. - zeroDays_notFiltered -> Folder with the zeroDay samples.

    In the case of N30S30 folder, there is an additional folder (SMBv2SMBv3NFS) with the samples extracted from the SMBv2, SMBv3 and NFS traffic traces. There are more binaries than the ones presented in the article, but it is because some of them are not "unseen" binaries (the families are present in the training set).

    The files containing samples (NxSy.csv, zeroDays.csv and userSamples_test.csv) are structured as follows: - Each line is one sample. - Each sample has 3*T features and the label (1 if it is 'infected' sample and 0 if it is not). - The features are separated by ',' because it is a csv file. - The last column is the label of the sample.

    Additionally we have placed two pcap files in root directory. There are the traces used for compare both versions of SMB.

  11. Cleaned Contoso Dataset

    • kaggle.com
    zip
    Updated Aug 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bhanu (2023). Cleaned Contoso Dataset [Dataset]. https://www.kaggle.com/datasets/bhanuthakurr/cleaned-contoso-dataset
    Explore at:
    zip(487695063 bytes)Available download formats
    Dataset updated
    Aug 27, 2023
    Authors
    Bhanu
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Data was imported from the BAK file found here into SQL Server, and then individual tables were exported as CSV. Jupyter Notebook containing the code used to clean the data can be found here

    Version 6 has a some more cleaning and structuring that was noticed after importing in Power BI. Changes were made by adding code in python notebook to export new cleaned dataset, such as adding MonthNumber for sorting by month number, similar for WeekDayNumber.

    Cleaning was done in python while also using SQL Server to quickly find things. Headers were added separately, ensuring no data loss.Data was cleaned for NaN, garbage values and other columns.

  12. Event Logs CSV

    • figshare.com
    rar
    Updated Dec 9, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dina Bayomie (2019). Event Logs CSV [Dataset]. http://doi.org/10.6084/m9.figshare.11342063.v1
    Explore at:
    rarAvailable download formats
    Dataset updated
    Dec 9, 2019
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Dina Bayomie
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The event logs in CSV format. The dataset contains both correlated and uncorrelated logs

  13. Dataset #1: Cross-sectional survey data

    • figshare.com
    txt
    Updated Jul 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adam Baimel (2023). Dataset #1: Cross-sectional survey data [Dataset]. http://doi.org/10.6084/m9.figshare.23708730.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jul 19, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Adam Baimel
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    N.B. This is not real data. Only here for an example for project templates.

    Project Title: Add title here

    Project Team: Add contact information for research project team members

    Summary: Provide a descriptive summary of the nature of your research project and its aims/focal research questions.

    Relevant publications/outputs: When available, add links to the related publications/outputs from this data.

    Data availability statement: If your data is not linked on figshare directly, provide links to where it is being hosted here (i.e., Open Science Framework, Github, etc.). If your data is not going to be made publicly available, please provide details here as to the conditions under which interested individuals could gain access to the data and how to go about doing so.

    Data collection details: 1. When was your data collected? 2. How were your participants sampled/recruited?

    Sample information: How many and who are your participants? Demographic summaries are helpful additions to this section.

    Research Project Materials: What materials are necessary to fully reproduce your the contents of your dataset? Include a list of all relevant materials (e.g., surveys, interview questions) with a brief description of what is included in each file that should be uploaded alongside your datasets.

    List of relevant datafile(s): If your project produces data that cannot be contained in a single file, list the names of each of the files here with a brief description of what parts of your research project each file is related to.

    Data codebook: What is in each column of your dataset? Provide variable names as they are encoded in your data files, verbatim question associated with each response, response options, details of any post-collection coding that has been done on the raw-response (and whether that's encoded in a separate column).

    Examples available at: https://www.thearda.com/data-archive?fid=PEWMU17 https://www.thearda.com/data-archive?fid=RELLAND14

  14. train csv file

    • kaggle.com
    zip
    Updated May 5, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emmanuel Arias (2018). train csv file [Dataset]. https://www.kaggle.com/datasets/eamanu/train
    Explore at:
    zip(33695 bytes)Available download formats
    Dataset updated
    May 5, 2018
    Authors
    Emmanuel Arias
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Dataset

    This dataset was created by Emmanuel Arias

    Released under Database: Open Database, Contents: Database Contents

    Contents

  15. UCI and OpenML Data Sets for Ordinal Quantification

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Jul 25, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz (2023). UCI and OpenML Data Sets for Ordinal Quantification [Dataset]. http://doi.org/10.5281/zenodo.8177302
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 25, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.

    With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.

    We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.

    Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.

    Usage

    You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.

    Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.

    Data Extraction: In your terminal, you can call either

    make

    (recommended), or

    julia --project="." --eval "using Pkg; Pkg.instantiate()"
    julia --project="." extract-oq.jl

    Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.

    Further Reading

    Implementation of our experiments: https://github.com/mirkobunse/regularized-oq

  16. MNIST dataset for Outliers Detection - [ MNIST4OD ]

    • figshare.com
    application/gzip
    Updated May 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Giovanni Stilo; Bardh Prenkaj (2024). MNIST dataset for Outliers Detection - [ MNIST4OD ] [Dataset]. http://doi.org/10.6084/m9.figshare.9954986.v2
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    May 17, 2024
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Giovanni Stilo; Bardh Prenkaj
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Here we present a dataset, MNIST4OD, of large size (number of dimensions and number of instances) suitable for Outliers Detection task.The dataset is based on the famous MNIST dataset (http://yann.lecun.com/exdb/mnist/).We build MNIST4OD in the following way:To distinguish between outliers and inliers, we choose the images belonging to a digit as inliers (e.g. digit 1) and we sample with uniform probability on the remaining images as outliers such as their number is equal to 10% of that of inliers. We repeat this dataset generation process for all digits. For implementation simplicity we then flatten the images (28 X 28) into vectors.Each file MNIST_x.csv.gz contains the corresponding dataset where the inlier class is equal to x.The data contains one instance (vector) in each line where the last column represents the outlier label (yes/no) of the data point. The data contains also a column which indicates the original image class (0-9).See the following numbers for a complete list of the statistics of each datasets ( Name | Instances | Dimensions | Number of Outliers in % ):MNIST_0 | 7594 | 784 | 10MNIST_1 | 8665 | 784 | 10MNIST_2 | 7689 | 784 | 10MNIST_3 | 7856 | 784 | 10MNIST_4 | 7507 | 784 | 10MNIST_5 | 6945 | 784 | 10MNIST_6 | 7564 | 784 | 10MNIST_7 | 8023 | 784 | 10MNIST_8 | 7508 | 784 | 10MNIST_9 | 7654 | 784 | 10

  17. Data from: Optimized SMRT-UMI protocol produces highly accurate sequence...

    • data.niaid.nih.gov
    • zenodo.org
    • +1more
    zip
    Updated Dec 7, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dylan Westfall; Mullins James (2023). Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies [Dataset]. http://doi.org/10.5061/dryad.w3r2280w0
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 7, 2023
    Dataset provided by
    HIV Vaccine Trials Networkhttp://www.hvtn.org/
    HIV Prevention Trials Networkhttp://www.hptn.org/
    National Institute of Allergy and Infectious Diseaseshttp://www.niaid.nih.gov/
    PEPFAR
    Authors
    Dylan Westfall; Mullins James
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Pathogen diversity resulting in quasispecies can enable persistence and adaptation to host defenses and therapies. However, accurate quasispecies characterization can be impeded by errors introduced during sample handling and sequencing which can require extensive optimizations to overcome. We present complete laboratory and bioinformatics workflows to overcome many of these hurdles. The Pacific Biosciences single molecule real-time platform was used to sequence PCR amplicons derived from cDNA templates tagged with universal molecular identifiers (SMRT-UMI). Optimized laboratory protocols were developed through extensive testing of different sample preparation conditions to minimize between-template recombination during PCR and the use of UMI allowed accurate template quantitation as well as removal of point mutations introduced during PCR and sequencing to produce a highly accurate consensus sequence from each template. Handling of the large datasets produced from SMRT-UMI sequencing was facilitated by a novel bioinformatic pipeline, Probabilistic Offspring Resolver for Primer IDs (PORPIDpipeline), that automatically filters and parses reads by sample, identifies and discards reads with UMIs likely created from PCR and sequencing errors, generates consensus sequences, checks for contamination within the dataset, and removes any sequence with evidence of PCR recombination or early cycle PCR errors, resulting in highly accurate sequence datasets. The optimized SMRT-UMI sequencing method presented here represents a highly adaptable and established starting point for accurate sequencing of diverse pathogens. These methods are illustrated through characterization of human immunodeficiency virus (HIV) quasispecies. Methods This serves as an overview of the analysis performed on PacBio sequence data that is summarized in Analysis Flowchart.pdf and was used as primary data for the paper by Westfall et al. "Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies" Five different PacBio sequencing datasets were used for this analysis: M027, M2199, M1567, M004, and M005 For the datasets which were indexed (M027, M2199), CCS reads from PacBio sequencing files and the chunked_demux_config files were used as input for the chunked_demux pipeline. Each config file lists the different Index primers added during PCR to each sample. The pipeline produces one fastq file for each Index primer combination in the config. For example, in dataset M027 there were 3–4 samples using each Index combination. The fastq files from each demultiplexed read set were moved to the sUMI_dUMI_comparison pipeline fastq folder for further demultiplexing by sample and consensus generation with that pipeline. More information about the chunked_demux pipeline can be found in the README.md file on GitHub. The demultiplexed read collections from the chunked_demux pipeline or CCS read files from datasets which were not indexed (M1567, M004, M005) were each used as input for the sUMI_dUMI_comparison pipeline along with each dataset's config file. Each config file contains the primer sequences for each sample (including the sample ID block in the cDNA primer) and further demultiplexes the reads to prepare data tables summarizing all of the UMI sequences and counts for each family (tagged.tar.gz) as well as consensus sequences from each sUMI and rank 1 dUMI family (consensus.tar.gz). More information about the sUMI_dUMI_comparison pipeline can be found in the paper and the README.md file on GitHub. The consensus.tar.gz and tagged.tar.gz files were moved from sUMI_dUMI_comparison pipeline directory on the server to the Pipeline_Outputs folder in this analysis directory for each dataset and appended with the dataset name (e.g. consensus_M027.tar.gz). Also in this analysis directory is a Sample_Info_Table.csv containing information about how each of the samples was prepared, such as purification methods and number of PCRs. There are also three other folders: Sequence_Analysis, Indentifying_Recombinant_Reads, and Figures. Each has an .Rmd file with the same name inside which is used to collect, summarize, and analyze the data. All of these collections of code were written and executed in RStudio to track notes and summarize results. Sequence_Analysis.Rmd has instructions to decompress all of the consensus.tar.gz files, combine them, and create two fasta files, one with all sUMI and one with all dUMI sequences. Using these as input, two data tables were created, that summarize all sequences and read counts for each sample that pass various criteria. These are used to help create Table 2 and as input for Indentifying_Recombinant_Reads.Rmd and Figures.Rmd. Next, 2 fasta files containing all of the rank 1 dUMI sequences and the matching sUMI sequences were created. These were used as input for the python script compare_seqs.py which identifies any matched sequences that are different between sUMI and dUMI read collections. This information was also used to help create Table 2. Finally, to populate the table with the number of sequences and bases in each sequence subset of interest, different sequence collections were saved and viewed in the Geneious program. To investigate the cause of sequences where the sUMI and dUMI sequences do not match, tagged.tar.gz was decompressed and for each family with discordant sUMI and dUMI sequences the reads from the UMI1_keeping directory were aligned using geneious. Reads from dUMI families failing the 0.7 filter were also aligned in Genious. The uncompressed tagged folder was then removed to save space. These read collections contain all of the reads in a UMI1 family and still include the UMI2 sequence. By examining the alignment and specifically the UMI2 sequences, the site of the discordance and its case were identified for each family as described in the paper. These alignments were saved as "Sequence Alignments.geneious". The counts of how many families were the result of PCR recombination were used in the body of the paper. Using Identifying_Recombinant_Reads.Rmd, the dUMI_ranked.csv file from each sample was extracted from all of the tagged.tar.gz files, combined and used as input to create a single dataset containing all UMI information from all samples. This file dUMI_df.csv was used as input for Figures.Rmd. Figures.Rmd used dUMI_df.csv, sequence_counts.csv, and read_counts.csv as input to create draft figures and then individual datasets for eachFigure. These were copied into Prism software to create the final figures for the paper.

  18. Company Documents Dataset

    • kaggle.com
    zip
    Updated May 23, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ayoub Cherguelaine (2024). Company Documents Dataset [Dataset]. https://www.kaggle.com/datasets/ayoubcherguelaine/company-documents-dataset
    Explore at:
    zip(9789538 bytes)Available download formats
    Dataset updated
    May 23, 2024
    Authors
    Ayoub Cherguelaine
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Overview

    This dataset contains a collection of over 2,000 company documents, categorized into four main types: invoices, inventory reports, purchase orders, and shipping orders. Each document is provided in PDF format, accompanied by a CSV file that includes the text extracted from these documents, their respective labels, and the word count of each document. This dataset is ideal for various natural language processing (NLP) tasks, including text classification, information extraction, and document clustering.

    Dataset Content

    PDF Documents: The dataset includes 2,677 PDF files, each representing a unique company document. These documents are derived from the Northwind dataset, which is commonly used for demonstrating database functionalities.

    The document types are:

    • Invoices: Detailed records of transactions between a buyer and a seller.
    • Inventory Reports: Records of inventory levels, including items in stock and units sold.
    • Purchase Orders: Requests made by a buyer to a seller to purchase products or services.
    • Shipping Orders: Instructions for the delivery of goods to specified recipients.

    Example Entries

    Here are a few example entries from the CSV file:

    Shipping Order:

    • Order ID: 10718
    • Shipping Details: "Ship Name: Königlich Essen, Ship Address: Maubelstr. 90, Ship City: ..."
    • Word Count: 120

    Invoice:

    • Order ID: 10707
    • Customer Details: "Customer ID: Arout, Order Date: 2017-10-16, Contact Name: Th..."
    • Word Count: 66

    Purchase Order:

    • Order ID: 10892
    • Order Details: "Order Date: 2018-02-17, Customer Name: Catherine Dewey, Products: Product ..."
    • Word Count: 26

    Applications

    This dataset can be used for:

    • Text Classification: Train models to classify documents into their respective categories.
    • Information Extraction: Extract specific fields and details from the documents.
    • Document Clustering: Group similar documents together based on their content.
    • OCR and Text Mining: Improve OCR (Optical Character Recognition) models and text mining techniques using real-world data.
  19. US Real Estate

    • zenrows.com
    csv
    Updated Jun 27, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ZenRows (2021). US Real Estate [Dataset]. https://www.zenrows.com/datasets/us-real-estate
    Explore at:
    csv(5,8MB)Available download formats
    Dataset updated
    Jun 27, 2021
    Dataset provided by
    ZenRows S.L.
    Authors
    ZenRows
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Area covered
    United States
    Description

    High-quality, free real estate dataset from all around the United States, in CSV format. Over 10.000 records relevant to Real Estate investors, agents, and data scientists. We are working on complete datasets from a wide variety of countries. Don't hesitate to contact us for more information.

  20. Physiological signals during activities for daily life: Dataset

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Mar 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eduardo Gutierrez Maestro; Eduardo Gutierrez Maestro (2022). Physiological signals during activities for daily life: Dataset [Dataset]. http://doi.org/10.5281/zenodo.6391454
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 29, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Eduardo Gutierrez Maestro; Eduardo Gutierrez Maestro
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset used in this work is composed by four participants, two men and two women. Each of them carried the wearable device Empatica E4 for a total number of 15 days. They carried the wearable during the day, and during the nights we asked participants to charge and load the data into an external memory unit. During these days, participants were asked to answer EMA questionnaires which are used to label our data. However, some participants could not complete the full experiment or some days were discarded due to data corruption. Specific demographic information, total sampling days and total number of EMA answers can be found in table I.

    Participant 1Participant 2Participant 3Participant 4
    Age67556063
    GenderMaleFemaleMaleFemale

    Final Valid Days

    9151213
    Total EMAs42576446

    Table I. Summary of participants' collected data.

    This dataset provides three different type of labels. Activeness and happiness are two of these labels. These are the answers to EMA questionnaires that participants reported during their daily activities. These labels are numbers between 0 and 4.
    These labels are used to interpolate the mental well-being state according to [1] We report in our dataset a total number of eight emotional states: (1) pleasure, (2) excitement, (3) arousal, (4) distress, (5) misery, (6) depression, (7) sleepiness, and (8) contentment.

    The data we provide in this repository consist of two type of files:

    • CSV files: These files contain physiological signals recorded during the data collection process. The first line of each CSV file defines the timestamp by which data started being sampled. The second line defines the sampling frequency used for gathering the signal. From the third line until the end of the file, one can find sampled datapoints.
    • Excel files: These files contain the labels obtained from EMA answers. It is indicated the timestamp at which the answer was registered. Labels for pleasure, activeness and mood can be found in this file.

    NOTE: Files are numbered according to each specific sampling day. For example, ACC1.csv corresponds to the signal ACC for sampling day 1. The same applied to excel files.

    Code and a tutorial of how to labelled and extract features can be found in this repository: https://github.com/edugm94/temporal-feat-emotion-prediction

    References:

    [1] . A. Russell, “A circumplex model of affect,” Journal of personality and social psychology, vol. 39, no. 6, p. 1161, 1980

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
CSIRO (2014). CSV file used in statistical analyses [Dataset]. http://doi.org/10.4225/08/543B4B4CA92E6
Organization logo

CSV file used in statistical analyses

Explore at:
Dataset updated
Oct 13, 2014
Dataset authored and provided by
CSIROhttp://www.csiro.au/
License

https://research.csiro.au/dap/licences/csiro-data-licence/https://research.csiro.au/dap/licences/csiro-data-licence/

Time period covered
Mar 14, 2008 - Jun 9, 2009
Dataset funded by
CSIROhttp://www.csiro.au/
Description

A csv file containing the tidal frequencies used for statistical analyses in the paper "Estimating Freshwater Flows From Tidally-Affected Hydrographic Data" by Dan Pagendam and Don Percival.

Search
Clear search
Close search
Google apps
Main menu