53 datasets found
  1. A

    ‘Indian Names Dataset’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Aug 10, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2020). ‘Indian Names Dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-indian-names-dataset-65ca/latest
    Explore at:
    Dataset updated
    Aug 10, 2020
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Indian Names Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/ananysharma/indian-names-dataset on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    Context

    This dataset is useful to me in terms of my project which i was working. Problem was to extract names from unstructured text and i am still working on it.I felt of sharing this as some of the people might find useful in some Named Entity Recognition and other nlp tasks. If you want you can work on how to extract names from unstructured text without any context.For eg if we have to extract names from a document where context is not present.You can share your work and we can work together for better.

    Content

    The dataset contains a male and female dataset along with a python preprocessing file for merging the two datasets.You can use either of the datset. Or you can see how we can merge both.

    Acknowledgements

    I get to know this dataset from a github repository which can be visited here

    --- Original source retains full ownership of the source dataset ---

  2. Z

    Data from: Mining Rule Violations in JavaScript Code Snippets

    • data.niaid.nih.gov
    • explore.openaire.eu
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Smethurst, Guilherme (2020). Mining Rule Violations in JavaScript Code Snippets [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2593817
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Moraes, João Pedro
    Smethurst, Guilherme
    Pinto, Gustavo
    Ferreira Campos, Uriel
    Bonifácio, Rodrigo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Content of this repository This is the repository that contains the scripts and dataset for the MSR 2019 mining challenge

    Github Repository with the software used : here.

    DATASET The dataset was retrived utilizing google bigquery and dumped to a csv file for further processing, this original file with no treatment is called jsanswers.csv, here we can find the following information : 1. The Id of the question (PostId) 2. The Content (in this case the code block) 3. the lenght of the code block 4. the line count of the code block 5. The score of the post 6. The title

    A quick look at this files, one can notice that a postID can have multiple rows related to it, that's how multiple codeblocks are saved in the database.

    Filtered Dataset:

    Extracting code from CSV We used a python script called "ExtractCodeFromCSV.py" to extract the code from the original csv and merge all the codeblocks in their respective javascript file with the postID as name, this resulted in 336 thousand files.

    Running ESlint Due to the single threaded nature of ESlint, we needed to create a script to run ESlint because it took a huge toll on the machine to run it on 336 thousand files, this script is named "ESlintRunnerScript.py", it splits the files in 20 evenly distributed parts and runs 20 processes of esLinter to generate the reports, as such it generates 20 json files.

    Number of Violations per Rule This information was extracted using the script named "parser.py", it generated the file named "NumberofViolationsPerRule.csv" which contains the number of violations per rule used in the linter configuration in the dataset.

    Number of violations per Category As a way to make relevant statistics of the dataset, we generated the number of violations per rule category as defined in the eslinter website, this information was extracted using the same "parser.py" script.

    Individual Reports This information was extracted from the json reports, it's a csv file with PostID and violations per rule.

    Rules The file Rules with categories contains all the rules used and their categories.

  3. H

    Dataset metadata of known Dataverse installations, August 2023

    • dataverse.harvard.edu
    • search.dataone.org
    Updated Aug 30, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julian Gautier (2024). Dataset metadata of known Dataverse installations, August 2023 [Dataset]. http://doi.org/10.7910/DVN/8FEGUV
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 30, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Julian Gautier
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset contains the metadata of the datasets published in 85 Dataverse installations and information about each installation's metadata blocks. It also includes the lists of pre-defined licenses or terms of use that dataset depositors can apply to the datasets they publish in the 58 installations that were running versions of the Dataverse software that include that feature. The data is useful for reporting on the quality of dataset and file-level metadata within and across Dataverse installations and improving understandings about how certain Dataverse features and metadata fields are used. Curators and other researchers can use this dataset to explore how well Dataverse software and the repositories using the software help depositors describe data. How the metadata was downloaded The dataset metadata and metadata block JSON files were downloaded from each installation between August 22 and August 28, 2023 using a Python script kept in a GitHub repo at https://github.com/jggautier/dataverse-scripts/blob/main/other_scripts/get_dataset_metadata_of_all_installations.py. In order to get the metadata from installations that require an installation account API token to use certain Dataverse software APIs, I created a CSV file with two columns: one column named "hostname" listing each installation URL in which I was able to create an account and another column named "apikey" listing my accounts' API tokens. The Python script expects the CSV file and the listed API tokens to get metadata and other information from installations that require API tokens. How the files are organized ├── csv_files_with_metadata_from_most_known_dataverse_installations │ ├── author(citation)_2023.08.22-2023.08.28.csv │ ├── contributor(citation)_2023.08.22-2023.08.28.csv │ ├── data_source(citation)_2023.08.22-2023.08.28.csv │ ├── ... │ └── topic_classification(citation)_2023.08.22-2023.08.28.csv ├── dataverse_json_metadata_from_each_known_dataverse_installation │ ├── Abacus_2023.08.27_12.59.59.zip │ ├── dataset_pids_Abacus_2023.08.27_12.59.59.csv │ ├── Dataverse_JSON_metadata_2023.08.27_12.59.59 │ ├── hdl_11272.1_AB2_0AQZNT_v1.0(latest_version).json │ ├── ... │ ├── metadatablocks_v5.6 │ ├── astrophysics_v5.6.json │ ├── biomedical_v5.6.json │ ├── citation_v5.6.json │ ├── ... │ ├── socialscience_v5.6.json │ ├── ACSS_Dataverse_2023.08.26_22.14.04.zip │ ├── ADA_Dataverse_2023.08.27_13.16.20.zip │ ├── Arca_Dados_2023.08.27_13.34.09.zip │ ├── ... │ └── World_Agroforestry_-_Research_Data_Repository_2023.08.27_19.24.15.zip └── dataverse_installations_summary_2023.08.28.csv └── dataset_pids_from_most_known_dataverse_installations_2023.08.csv └── license_options_for_each_dataverse_installation_2023.09.05.csv └── metadatablocks_from_most_known_dataverse_installations_2023.09.05.csv This dataset contains two directories and four CSV files not in a directory. One directory, "csv_files_with_metadata_from_most_known_dataverse_installations", contains 20 CSV files that list the values of many of the metadata fields in the citation metadata block and geospatial metadata block of datasets in the 85 Dataverse installations. For example, author(citation)_2023.08.22-2023.08.28.csv contains the "Author" metadata for the latest versions of all published, non-deaccessioned datasets in the 85 installations, where there's a row for author names, affiliations, identifier types and identifiers. The other directory, "dataverse_json_metadata_from_each_known_dataverse_installation", contains 85 zipped files, one for each of the 85 Dataverse installations whose dataset metadata I was able to download. Each zip file contains a CSV file and two sub-directories: The CSV file contains the persistent IDs and URLs of each published dataset in the Dataverse installation as well as a column to indicate if the Python script was able to download the Dataverse JSON metadata for each dataset. It also includes the alias/identifier and category of the Dataverse collection that the dataset is in. One sub-directory contains a JSON file for each of the installation's published, non-deaccessioned dataset versions. The JSON files contain the metadata in the "Dataverse JSON" metadata schema. The Dataverse JSON export of the latest version of each dataset includes "(latest_version)" in the file name. This should help those who are interested in the metadata of only the latest version of each dataset. The other sub-directory contains information about the metadata models (the "metadata blocks" in JSON files) that the installation was using when the dataset metadata was downloaded. I included them so that they can be used when extracting metadata from the dataset's Dataverse JSON exports. The dataverse_installations_summary_2023.08.28.csv file contains information about each installation, including its name, URL, Dataverse software version, and counts of dataset metadata...

  4. Pharmacogenomics Datasets for Cancer Cell Lines from CellMiner...

    • zenodo.org
    application/gzip
    Updated May 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Augustin Luna; Augustin Luna; Fathi Elloumi; Fathi Elloumi; Vinodh Rajapakse; Vinodh Rajapakse (2025). Pharmacogenomics Datasets for Cancer Cell Lines from CellMiner Cross-Database (CellMinerCDB) [Dataset]. http://doi.org/10.5281/zenodo.15122311
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    May 20, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Augustin Luna; Augustin Luna; Fathi Elloumi; Fathi Elloumi; Vinodh Rajapakse; Vinodh Rajapakse
    License

    https://www.gnu.org/licenses/lgpl-3.0-standalone.htmlhttps://www.gnu.org/licenses/lgpl-3.0-standalone.html

    Description

    Cell line pharmacogenomics datasets for cancer biology and machine learning studies. The datasets are compatible with rcellminer and CellMinerCDB (see publications for details) and data can be extracted for use with Python-based projects.

    An example for extracting data from the rcellminer and CellMinerCDB compatible packages:

    # INSTALL ----
    if (!require("BiocManager", quietly = TRUE))
      install.packages("BiocManager")
    
    BiocManager::install("rcellminer")
    
    # Replace path_to_file with the data package filename
    install.packages(path_to_file, repos = NULL, type="source")
    
    # GET DATA ----
    ## Replace nciSarcomaData with name of dataset through code 
    library(nciSarcomaData)
    
    ## DRUG DATA ----
    drugAct <- exprs(getAct(nciSarcomaData::drugData))
    drugAnnot <- getFeatureAnnot(nciSarcomaData::drugData)[["drug"]]
    
    ## MOLECULAR DATA ----
    ### List available datasets
    names(getAllFeatureData(nciSarcomaData::molData))
    
    ### Extract data and annotations
    expData <- exprs(nciSarcomaData::molData[["exp"]])
    mirData <- exprs(nciSarcomaData::molData[["mir"]])
    
    expAnnot <- getFeatureAnnot(nciSarcomaData::molData)[["exp"]]
    mirAnnot <- getFeatureAnnot(nciSarcomaData::molData)[["mir"]]
    
    ## SAMPLE DATA ----
    sampleAnnot <- getSampleData(nciSarcomaData::molData)

  5. n

    Build better LibGuides: A dataset of Political Science, Public Affairs, and...

    • data.niaid.nih.gov
    • search.dataone.org
    • +1more
    zip
    Updated May 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Annelise Sklar (2024). Build better LibGuides: A dataset of Political Science, Public Affairs, and International Studies LibGuides [Dataset]. http://doi.org/10.5061/dryad.prr4xgxvk
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 30, 2024
    Dataset provided by
    University of California, San Diego
    Authors
    Annelise Sklar
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    The dataset that accompanies the "Build Better LibGuides" chapter of Teaching Information Literacy in Political Science, Public Affairs, and International Studies. This dataset was created to compare current practices in Political Science, Public Affairs, and International Studies (PSPAIS) LibGuides with recommended best practices using a sample that represents a variety of academic institutions. Members of the ACRL Politics, Policy, and International Relations Section (PPIRS) were identified as the librarians most likely to be actively engaged with these specific subjects, so the dataset was scoped by identifying the institutions associated with the most active PPIRS members and then locating the LibGuides in these and related disciplines. The resulting dataset includes 101 guides at 46 institutions, for a total of 887 LibGuide tabs. Methods This dataset was created to compare current practices in Political Science, Public Affairs, and International Studies (PSPAIS) LibGuides with recommended best practices using a sample that represents a variety of academic institutions. Members of the ACRL Politics, Policy, and International Relations Section (PPIRS) were identified as the librarians most likely to be actively engaged with these specific subjects, so the dataset was scoped by identifying the institutions associated with the most active PPIRS members and then locating the LibGuides in these and related disciplines. Specifically, a student assistant collected the names and institutional affiliations of each member serving on a PPIRS committee as of July 1, 2021, 2022, and 2023. The student then removed the individual librarian names from the list and located the links to the Political Science or Government; Public Policy, Public Affairs, or Public Administration; and International Studies or International Relations LibGuides at each institution. The chapter author then confirmed and, in a few cases, added to the student's work and copied and pasted the tab names from each guide (which conveniently were also hyperlinked) into a Google Sheet. The resulting dataset included 101 guides at 46 institutions, for a total of 887 LibGuide tabs. A Google Apps script was used to extract the hyperlinks from the collected tab names and then a Python script was used to scrape the names of links included on each of the tabs. LibGuides from two institutions returned errors during the link name scraping process and were excluded in this part of the analysis.

  6. Python Import Data India – Buyers & Importers List

    • seair.co.in
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seair Exim, Python Import Data India – Buyers & Importers List [Dataset]. https://www.seair.co.in
    Explore at:
    .bin, .xml, .csv, .xlsAvailable download formats
    Dataset provided by
    Seair Exim Solutions
    Authors
    Seair Exim
    Area covered
    India
    Description

    Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.

  7. o

    Twitter Public Sentiment Dataset

    • opendatabay.com
    .undefined
    Updated Jul 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Twitter Public Sentiment Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/04ea3224-1b10-48d4-871a-496c9a2633ff
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 6, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Telecommunications & Network Data
    Description

    This dataset provides a collection of 1000 tweets designed for sentiment analysis. The tweets were sourced from Twitter using Python and systematically generated using various modules to ensure a balanced representation of different tweet types, user behaviours, and sentiments. This includes the use of a random module for IDs and text, a faker module for usernames and dates, and a textblob module for assigning sentiment. The dataset's purpose is to offer a robust foundation for analysing and visualising sentiment trends and patterns, aiding in the initial exploration of data and the identification of significant patterns or trends.

    Columns

    • Tweet ID: A unique identifier assigned to each individual tweet.
    • Text: The actual textual content of the tweet.
    • User: The username of the individual who posted the tweet.
    • Created At: The date and time when the tweet was originally published.
    • Likes: The total number of likes or approvals the tweet received.
    • Retweets: The total count of times the tweet was shared by other users.
    • Sentiment: The categorised emotional tone of the tweet, typically labelled as positive, neutral, or negative.

    Distribution

    The dataset is provided in a CSV file format. It consists of 1000 individual tweet records, structured in a tabular layout with the columns detailed above. A sample file will be made available separately on the platform.

    Usage

    This dataset is ideal for: * Analysing and visualising sentiment trends and patterns in social media. * Initial data exploration to uncover insights into tweet characteristics and user emotions. * Identifying underlying patterns or trends within social media conversations. * Developing and training machine learning models for sentiment classification. * Academic research into Natural Language Processing (NLP) and social media dynamics. * Educational purposes, allowing students to practise data analysis and visualisation techniques.

    Coverage

    The dataset spans tweets created between January and April 2023, as observed from the included data samples. While specific geographic or demographic information for users is not available within the dataset, the nature of Twitter implies a general global scope, reflecting a variety of user behaviours and sentiments without specific regional or population group focus.

    License

    CC0

    Who Can Use It

    This dataset is valuable for: * Data Scientists and Machine Learning Engineers working on NLP tasks and model development. * Researchers in fields such as Natural Language Processing, Machine Learning Algorithms, Deep Learning, and Computer Science. * Data Analysts looking to extract insights from social media content. * Academics and Students undertaking projects related to sentiment analysis or social media studies. * Anyone interested in understanding online sentiment and user behaviour on social media platforms.

    Dataset Name Suggestions

    • Twitter Public Sentiment Dataset
    • Social Media Text Sentiment Analysis
    • General Tweet Mood Data
    • Twitter Sentiment Collection 2023
    • Microblog Sentiment Dataset

    Attributes

    Original Data Source: Twitter Sentiment Analysis using Roberta and VaderTwitter Sentiment Analysis using Roberta and Vader

  8. US Means of Transportation to Work Census Data

    • kaggle.com
    Updated Feb 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sagar G (2022). US Means of Transportation to Work Census Data [Dataset]. https://www.kaggle.com/goswamisagard/american-census-survey-b08301-cleaned-csv-data/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 23, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sagar G
    Area covered
    United States
    Description

    US Census Bureau conducts American Census Survey 1 and 5 Yr surveys that record various demographics and provide public access through APIs. I have attempted to call the APIs through the python environment using the requests library, Clean, and organize the data in a usable format.

    Data Ingestion and Cleaning:

    ACS Subject data [2011-2019] was accessed using Python by following the below API Link: https://api.census.gov/data/2011/acs/acs1?get=group(B08301)&for=county:* The data was obtained in JSON format by calling the above API, then imported as Python Pandas Dataframe. The 84 variables returned have 21 Estimate values for various metrics, 21 pairs of respective Margin of Error, and respective Annotation values for Estimate and Margin of Error Values. This data was then undergone through various cleaning processes using Python, where excess variables were removed, and the column names were renamed. Web-Scraping was carried out to extract the variables' names and replace the codes in the column names in raw data.

    The above step was carried out for multiple ACS/ACS-1 datasets spanning 2011-2019 and then merged into a single Python Pandas Dataframe. The columns were rearranged, and the "NAME" column was split into two columns, namely 'StateName' and 'CountyName.' The counties for which no data was available were also removed from the Dataframe. Once the Dataframe was ready, it was separated into two new dataframes for separating State and County Data and exported into '.csv' format

    Data Source:

    More information about the source of Data can be found at the URL below: US Census Bureau. (n.d.). About: Census Bureau API. Retrieved from Census.gov https://www.census.gov/data/developers/about.html

    Final Word:

    I hope this data helps you to create something beautiful, and awesome. I will be posting a lot more databases shortly, if I get more time from assignments, submissions, and Semester Projects 🧙🏼‍♂️. Good Luck.

  9. Z

    Geographic Diversity in Public Code Contributions — Replication Package

    • data.niaid.nih.gov
    Updated Mar 31, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Davide Rossi (2022). Geographic Diversity in Public Code Contributions — Replication Package [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6390354
    Explore at:
    Dataset updated
    Mar 31, 2022
    Dataset provided by
    Davide Rossi
    Stefano Zacchiroli
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Geographic Diversity in Public Code Contributions - Replication Package

    This document describes how to replicate the findings of the paper: Davide Rossi and Stefano Zacchiroli, 2022, Geographic Diversity in Public Code Contributions - An Exploratory Large-Scale Study Over 50 Years. In 19th International Conference on Mining Software Repositories (MSR ’22), May 23-24, Pittsburgh, PA, USA. ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3524842.3528471

    This document comes with the software needed to mine and analyze the data presented in the paper.

    Prerequisites

    These instructions assume the use of the bash shell, the Python programming language, the PosgreSQL DBMS (version 11 or later), the zstd compression utility and various usual *nix shell utilities (cat, pv, …), all of which are available for multiple architectures and OSs. It is advisable to create a Python virtual environment and install the following PyPI packages:

    click==8.0.4 cycler==0.11.0 fonttools==4.31.2 kiwisolver==1.4.0 matplotlib==3.5.1 numpy==1.22.3 packaging==21.3 pandas==1.4.1 patsy==0.5.2 Pillow==9.0.1 pyparsing==3.0.7 python-dateutil==2.8.2 pytz==2022.1 scipy==1.8.0 six==1.16.0 statsmodels==0.13.2

    Initial data

    swh-replica, a PostgreSQL database containing a copy of Software Heritage data. The schema for the database is available at https://forge.softwareheritage.org/source/swh-storage/browse/master/swh/storage/sql/. We retrieved these data from Software Heritage, in collaboration with the archive operators, taking an archive snapshot as of 2021-07-07. We cannot make these data available in full as part of the replication package due to both its volume and the presence in it of personal information such as user email addresses. However, equivalent data (stripped of email addresses) can be obtained from the Software Heritage archive dataset, as documented in the article: Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli, The Software Heritage Graph Dataset: Public software development under one roof. In proceedings of MSR 2019: The 16th International Conference on Mining Software Repositories, May 2019, Montreal, Canada. Pages 138-142, IEEE 2019. http://dx.doi.org/10.1109/MSR.2019.00030. Once retrieved, the data can be loaded in PostgreSQL to populate swh-replica.

    names.tab - forenames and surnames per country with their frequency

    zones.acc.tab - countries/territories, timezones, population and world zones

    c_c.tab - ccTDL entities - world zones matches

    Data preparation

    Export data from the swh-replica database to create commits.csv.zst and authors.csv.zst

    sh> ./export.sh

    Run the authors cleanup script to create authors--clean.csv.zst

    sh> ./cleanup.sh authors.csv.zst

    Filter out implausible names and create authors--plausible.csv.zst

    sh> pv authors--clean.csv.zst | unzstd | ./filter_names.py 2> authors--plausible.csv.log | zstdmt > authors--plausible.csv.zst

    Zone detection by email

    Run the email detection script to create author-country-by-email.tab.zst

    sh> pv authors--plausible.csv.zst | zstdcat | ./guess_country_by_email.py -f 3 2> author-country-by-email.csv.log | zstdmt > author-country-by-email.tab.zst

    Database creation and initial data ingestion

    Create the PostgreSQL DB

    sh> createdb zones-commit

    Notice that from now on when prepending the psql> prompt we assume the execution of psql on the zones-commit database.

    Import data into PostgreSQL DB

    sh> ./import_data.sh

    Zone detection by name

    Extract commits data from the DB and create commits.tab, that is used as input for the zone detection script

    sh> psql -f extract_commits.sql zones-commit

    Run the world zone detection script to create commit_zones.tab.zst

    sh> pv commits.tab | ./assign_world_zone.py -a -n names.tab -p zones.acc.tab -x -w 8 | zstdmt > commit_zones.tab.zst Use ./assign_world_zone.py --help if you are interested in changing the script parameters.

    Ingest zones assignment data into the DB

    psql> \copy commit_zone from program 'zstdcat commit_zones.tab.zst | cut -f1,6 | grep -Ev ''\s$'''

    Extraction and graphs

    Run the script to execute the queries to extract the data to plot from the DB. This creates commit_zones_7120.tab, author_zones_7120_t5.tab, commit_zones_7120.grid and author_zones_7120_t5.grid. Edit extract_data.sql if you whish to modify extraction parameters (start/end year, sampling, …).

    sh> ./extract_data.sh

    Run the script to create the graphs from all the previously extracted tabfiles.

    sh> ./create_stackedbar_chart.py -w 20 -s 1971 -f commit_zones_7120.grid -f author_zones_7120_t5.grid -o chart.pdf

  10. P

    OpenAsp Dataset

    • paperswithcode.com
    Updated Dec 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shmuel Amar; Liat Schiff; Ori Ernst; Asi Shefer; Ori Shapira; Ido Dagan (2023). OpenAsp Dataset [Dataset]. https://paperswithcode.com/dataset/openasp
    Explore at:
    Dataset updated
    Dec 6, 2023
    Authors
    Shmuel Amar; Liat Schiff; Ori Ernst; Asi Shefer; Ori Shapira; Ido Dagan
    Description

    OpenAsp Dataset OpenAsp is an Open Aspect-based Multi-Document Summarization dataset derived from DUC and MultiNews summarization datasets.

    Dataset Access To generate OpenAsp, you require access to the DUC dataset which OpenAsp is derived from.

    Steps:

    Grant access to DUC dataset by following NIST instructions here. you should receive two user-password pairs (for DUC01-02 and DUC06-07) you should receive a file named fwdrequestingducdata.zip Clone this repository by running the following command: git clone https://github.com/liatschiff/OpenAsp.git Optionally create a conda or virtualenv environment:

    bash conda create -n openasp 'python>3.10,<3.11' conda activate openasp

    Install python requirements, currently requires python3.8-3.10 (later python versions have issues with spacy)

    bash pip install -r requirements.txt

    copy fwdrequestingducdata.zip into the OpenAsp repo directory

    run the prepare script command:

    bash python prepare_openasp_dataset.py --nist-duc2001-user '<2001-user>' --nist-duc2001-password '<2001-pwd>' --nist-duc2006-user '<2006-user>' --nist-duc2006-password '<2006-pwd>'

    load the dataset using huggingface datasets

    from glob import glob
    import os
    import gzip
    import shutil
    from datasets import load_dataset
    
    openasp_files = os.path.join('openasp-v1', '*.jsonl.gz')
    
    data_files = {
      os.path.basename(fname).split('.')[0]: fname
      for fname in glob(openasp_files)
    }
    
    for ftype, fname in data_files.copy().items():
      with gzip.open(fname, 'rb') as gz_file:
        with open(fname[:-3], 'wb') as output_file:
          shutil.copyfileobj(gz_file, output_file)
      data_files[ftype] = fname[:-3]
    
    load OpenAsp as huggingface's dataset
    openasp = load_dataset('json', data_files=data_files)
    
    print first sample from every split
    for split in ['train', 'valid', 'test']:
      sample = openasp[split][0]
    
    # print title, aspect_label, summary and documents for the sample
    title = sample['title']
    aspect_label = sample['aspect_label']
    summary = '
    '.join(sample['summary_text'])
    input_docs_text = ['
    '.join(d['text']) for d in sample['documents']]
    
    print('* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *')
    print(f'Sample from {split}
    Split title={title}
    Aspect label={aspect_label}')
    print(f'
    aspect-based summary:
     {summary}')
    print('
    input documents:
    ')
    for i, doc_txt in enumerate(input_docs_text):
      print(f'---- doc #{i} ----')
      print(doc_txt[:256] + '...')
    print('* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
    
    
    ')
    
    

    Troubleshooting

    Dataset failed loading with load_dataset() - you may want to delete huggingface datasets cache folder 401 Client Error: Unauthorized - you're DUC credentials are incorrect, please verify them (case sensitive, no extra spaces etc) Dataset created but prints a warning about content verification - you may be using different version of NLTK or spacy model which affects the sentence tokenization process. You must use exact versions as pinned on requirements.txt. IndexError: list index out of range - similar to (3), try to reinstall the requirements with exact package versions.

    Under The Hood The prepare_openasp_dataset.py script downloads DUC and Multi-News source files, uses sacrerouge package to prepare the datasets and uses the openasp_v1_dataset_metadata.json file to extract the relevant aspect summaries and compile the final OpenAsp dataset.

    License This repository, including the openasp_v1_dataset_metadata.json and prepare_openasp_dataset.py, are released under APACHE license.

    OpenAsp dataset summary and source document for each sample, which are generated by running the script, are licensed under the respective generic summarization dataset - Multi-News license and DUC license.

  11. Database of influencers' tweets in cryptocurrency (2021-2023)

    • cryptodata.center
    • data.mendeley.com
    Updated Dec 4, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    cryptodata.center (2024). Database of influencers' tweets in cryptocurrency (2021-2023) [Dataset]. https://cryptodata.center/dataset/https-data-mendeley-com-datasets-8fbdhh72gs-5
    Explore at:
    Dataset updated
    Dec 4, 2024
    Dataset provided by
    CryptoDATA
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Authors, through Twitter API, collected this database over eight months. These data are tweets of over 50 experts regarding market analysis of 40 cryptocurrencies. These experts are known as influencers on social networks such as Twitter. The theory of Behavioral economics shows that the opinions of people, especially experts, can impact the stock market trend (here, cryptocurrencies). Existing databases often cover tweets related to one or more cryptocurrencies. Also, in these databases, no attention is paid to the user's expertise, and most of the data is extracted using hashtags. Failure to pay attention to the user's expertise causes the irrelevant volume to increase and the neutral polarity to increase considerably. This database has a main table named "Tweets1" with 11 columns and 40 tables to separate comments related to each cryptocurrency. The columns of the main table and the cryptocurrency tables are explained in the attached document. Researchers can use this dataset in various machine learning tasks, such as sentiment analysis and deep transfer learning with sentiment analysis. Also, this data can be used to check the impact of influencers' opinions on the cryptocurrency market trend. The use of this database is allowed by mentioning the source. Also, in this version, we have added the excel version of the database and Python code to extract the names of influencers and tweets. in Version(3): In the new version, three datasets related to historical prices and sentiments related to Bitcoin, Ethereum, and Binance have been added as Excel files from January 1, 2023, to June 12, 2023. Also, two datasets of 52 influential tweets in cryptocurrencies have been published, along with the score and polarity of sentiments regarding more than 300 cryptocurrencies from February 2021 to June 2023. Also, two Python codes related to the sentiment analysis algorithm of tweets with Python have been published. This algorithm combines RoBERTa pre-trained deep neural network and BiGRU deep neural network with an attention layer (see code Preprocessing_and_sentiment_analysis with python).

  12. t

    ESA CCI SM GAPFILLED Long-term Climate Data Record of Surface Soil Moisture...

    • researchdata.tuwien.ac.at
    zip
    Updated Jun 6, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wolfgang Preimesberger; Wolfgang Preimesberger; Pietro Stradiotti; Pietro Stradiotti; Wouter Arnoud Dorigo; Wouter Arnoud Dorigo (2025). ESA CCI SM GAPFILLED Long-term Climate Data Record of Surface Soil Moisture from merged multi-satellite observations [Dataset]. http://doi.org/10.48436/3fcxr-cde10
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 6, 2025
    Dataset provided by
    TU Wien
    Authors
    Wolfgang Preimesberger; Wolfgang Preimesberger; Pietro Stradiotti; Pietro Stradiotti; Wouter Arnoud Dorigo; Wouter Arnoud Dorigo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    This dataset was produced with funding from the European Space Agency (ESA) Climate Change Initiative (CCI) Plus Soil Moisture Project (CCN 3 to ESRIN Contract No: 4000126684/19/I-NB "ESA CCI+ Phase 1 New R&D on CCI ECVS Soil Moisture"). Project website: https://climate.esa.int/en/projects/soil-moisture/

    This dataset contains information on the Surface Soil Moisture (SM) content derived from satellite observations in the microwave domain.

    Dataset paper (public preprint)

    A description of this dataset, including the methodology and validation results, is available at:

    Preimesberger, W., Stradiotti, P., and Dorigo, W.: ESA CCI Soil Moisture GAPFILLED: An independent global gap-free satellite climate data record with uncertainty estimates, Earth Syst. Sci. Data Discuss. [preprint], https://doi.org/10.5194/essd-2024-610, in review, 2025.

    Abstract

    ESA CCI Soil Moisture is a multi-satellite climate data record that consists of harmonized, daily observations coming from 19 satellites (as of v09.1) operating in the microwave domain. The wealth of satellite information, particularly over the last decade, facilitates the creation of a data record with the highest possible data consistency and coverage.
    However, data gaps are still found in the record. This is particularly notable in earlier periods when a limited number of satellites were in operation, but can also arise from various retrieval issues, such as frozen soils, dense vegetation, and radio frequency interference (RFI). These data gaps present a challenge for many users, as they have the potential to obscure relevant events within a study area or are incompatible with (machine learning) software that often relies on gap-free inputs.
    Since the requirement of a gap-free ESA CCI SM product was identified, various studies have demonstrated the suitability of different statistical methods to achieve this goal. A fundamental feature of such gap-filling method is to rely only on the original observational record, without need for ancillary variable or model-based information. Due to the intrinsic challenge, there was until present no global, long-term univariate gap-filled product available. In this version of the record, data gaps due to missing satellite overpasses and invalid measurements are filled using the Discrete Cosine Transform (DCT) Penalized Least Squares (PLS) algorithm (Garcia, 2010). A linear interpolation is applied over periods of (potentially) frozen soils with little to no variability in (frozen) soil moisture content. Uncertainty estimates are based on models calibrated in experiments to fill satellite-like gaps introduced to GLDAS Noah reanalysis soil moisture (Rodell et al., 2004), and consider the gap size and local vegetation conditions as parameters that affect the gapfilling performance.

    Summary

    • Gap-filled global estimates of volumetric surface soil moisture from 1991-2023 at 0.25° sampling
    • Fields of application (partial): climate variability and change, land-atmosphere interactions, global biogeochemical cycles and ecology, hydrological and land surface modelling, drought applications, and meteorology
    • Method: Modified version of DCT-PLS (Garcia, 2010) interpolation/smoothing algorithm, linear interpolation over periods of frozen soils. Uncertainty estimates are provided for all data points.
    • More information: See Preimesberger et al. (2025) and https://doi.org/10.5281/zenodo.8320869" target="_blank" rel="noopener">ESA CCI SM Algorithm Theoretical Baseline Document [Chapter 7.2.9] (Dorigo et al., 2023)

    Programmatic Download

    You can use command line tools such as wget or curl to download (and extract) data for multiple years. The following command will download and extract the complete data set to the local directory ~/Download on Linux or macOS systems.

    #!/bin/bash

    # Set download directory
    DOWNLOAD_DIR=~/Downloads

    base_url="https://researchdata.tuwien.at/records/3fcxr-cde10/files"

    # Loop through years 1991 to 2023 and download & extract data
    for year in {1991..2023}; do
    echo "Downloading $year.zip..."
    wget -q -P "$DOWNLOAD_DIR" "$base_url/$year.zip"
    unzip -o "$DOWNLOAD_DIR/$year.zip" -d $DOWNLOAD_DIR
    rm "$DOWNLOAD_DIR/$year.zip"
    done

    Data details

    The dataset provides global daily estimates for the 1991-2023 period at 0.25° (~25 km) horizontal grid resolution. Daily images are grouped by year (YYYY), each subdirectory containing one netCDF image file for a specific day (DD), month (MM) in a 2-dimensional (longitude, latitude) grid system (CRS: WGS84). The file name has the following convention:

    ESACCI-SOILMOISTURE-L3S-SSMV-COMBINED_GAPFILLED-YYYYMMDD000000-fv09.1r1.nc

    Data Variables

    Each netCDF file contains 3 coordinate variables (WGS84 longitude, latitude and time stamp), as well as the following data variables:

    • sm: (float) The Soil Moisture variable reflects estimates of daily average volumetric soil moisture content (m3/m3) in the soil surface layer (~0-5 cm) over a whole grid cell (0.25 degree).
    • sm_uncertainty: (float) The Soil Moisture Uncertainty variable reflects the uncertainty (random error) of the original satellite observations and of the predictions used to fill observation data gaps.
    • sm_anomaly: Soil moisture anomalies (reference period 1991-2020) derived from the gap-filled values (`sm`)
    • sm_smoothed: Contains DCT-PLS predictions used to fill data gaps in the original soil moisture field. These values are also provided for cases where an observation was initially available (compare `gapmask`). In this case, they provided a smoothed version of the original data.
    • gapmask: (0 | 1) Indicates grid cells where a satellite observation is available (1), and where the interpolated (smoothed) values are used instead (0) in the 'sm' field.
    • frozenmask: (0 | 1) Indicates grid cells where ERA5 soil temperature is <0 °C. In this case, a linear interpolation over time is applied.

    Additional information for each variable is given in the netCDF attributes.

    Version Changelog

    Changes in v9.1r1 (previous version was v09.1):

    • This version uses a novel uncertainty estimation scheme as described in Preimesberger et al. (2025).

    Software to open netCDF files

    These data can be read by any software that supports Climate and Forecast (CF) conform metadata standards for netCDF files, such as:

    References

    • Preimesberger, W., Stradiotti, P., and Dorigo, W.: ESA CCI Soil Moisture GAPFILLED: An independent global gap-free satellite climate data record with uncertainty estimates, Earth Syst. Sci. Data Discuss. [preprint], https://doi.org/10.5194/essd-2024-610, in review, 2025.
    • Dorigo, W., Preimesberger, W., Stradiotti, P., Kidd, R., van der Schalie, R., van der Vliet, M., Rodriguez-Fernandez, N., Madelon, R., & Baghdadi, N. (2023). ESA Climate Change Initiative Plus - Soil Moisture Algorithm Theoretical Baseline Document (ATBD) Supporting Product Version 08.1 (version 1.1). Zenodo. https://doi.org/10.5281/zenodo.8320869
    • Garcia, D., 2010. Robust smoothing of gridded data in one and higher dimensions with missing values. Computational Statistics & Data Analysis, 54(4), pp.1167-1178. Available at: https://doi.org/10.1016/j.csda.2009.09.020
    • Rodell, M., Houser, P. R., Jambor, U., Gottschalck, J., Mitchell, K., Meng, C.-J., Arsenault, K., Cosgrove, B., Radakovich, J., Bosilovich, M., Entin, J. K., Walker, J. P., Lohmann, D., and Toll, D.: The Global Land Data Assimilation System, Bulletin of the American Meteorological Society, 85, 381 – 394, https://doi.org/10.1175/BAMS-85-3-381, 2004.

    Related Records

    The following records are all part of the Soil Moisture Climate Data Records from satellites community

    1

    ESA CCI SM MODELFREE Surface Soil Moisture Record

    <a href="https://doi.org/10.48436/svr1r-27j77" target="_blank"

  13. S

    NASICON-type solid electrolyte materials named entity recognition dataset

    • scidb.cn
    Updated Apr 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Liu Yue; Liu Dahui; Yang Zhengwei; Shi Siqi (2023). NASICON-type solid electrolyte materials named entity recognition dataset [Dataset]. http://doi.org/10.57760/sciencedb.j00213.00001
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 27, 2023
    Dataset provided by
    Science Data Bank
    Authors
    Liu Yue; Liu Dahui; Yang Zhengwei; Shi Siqi
    Description

    1.Framework overview. This paper proposed a pipeline to construct high-quality datasets for text mining in materials science. Firstly, we utilize the traceable automatic acquisition scheme of literature to ensure the traceability of textual data. Then, a data processing method driven by downstream tasks is performed to generate high-quality pre-annotated corpora conditioned on the characteristics of materials texts. On this basis, we define a general annotation scheme derived from materials science tetrahedron to complete high-quality annotation. Finally, a conditional data augmentation model incorporating materials domain knowledge (cDA-DK) is constructed to augment the data quantity.2.Dataset information. The experimental datasets used in this paper include: the Matscholar dataset publicly published by Weston et al. (DOI: 10.1021/acs.jcim.9b00470), and the NASICON entity recognition dataset constructed by ourselves. Herein, we mainly introduce the details of NASICON entity recognition dataset.2.1 Data collection and preprocessing. Firstly, 55 materials science literature related to NASICON system are collected through Crystallographic Information File (CIF), which contains a wealth of structure-activity relationship information. Note that materials science literature is mostly stored as portable document format (PDF), with content arranged in columns and mixed with tables, images, and formulas, which significantly compromises the readability of the text sequence. To tackle this issue, we employ the text parser PDFMiner (a Python toolkit) to standardize, segment, and parse the original documents, thereby converting PDF literature into plain text. In this process, the entire textual information of literature, encompassing title, author, abstract, keywords, institution, publisher, and publication year, is retained and stored as a unified TXT document. Subsequently, we apply rules based on Python regular expressions to remove redundant information, such as garbled characters and line breaks caused by figures, tables, and formulas. This results in a cleaner text corpus, enhancing its readability and enabling more efficient data analysis. Note that special symbols may also appear as garbled characters, but we refrain from directly deleting them, as they may contain valuable information such as chemical units. Therefore, we converted all such symbols to a special token

  14. Data from: Dynamic Slicing of WebAssembly Binaries

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jul 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous; Anonymous (2023). Dynamic Slicing of WebAssembly Binaries [Dataset]. http://doi.org/10.5281/zenodo.8025836
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 17, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymous; Anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the replication package that accompanies the paper titled: "Dynamic Slicing of WebAssembly Binaries".

    # Slices dataset
    ## Generating the dataset
    The dynamic slices have been generated with [P-ORBS](https://syed-islam.github.io/research/program-analysis/#observation-based-program-slicing-orbs).
    The steps and scripts to generate the dynamic slices are included in the `slicing-steps/` directory.
    These steps also describe how to generate the `stats.csv` file that is included in this dataset.

    The static slices have been generated with [wassail](https://github.com/acieroid/wassail).
    The scripts to generate the static slices are included in the current directory (`generate-static-slices.sh` which relies on `run_wassail.sh`).
    These steps generate the file `static-stats.csv` that is included in this dataset.

    The `original-size.csv` file, included in this dataset, can be generated as follows:
    ```sh
    find subjects-wasm-extract-slice -name \*.c.wat -exec c_count {} \; | grep subjects > counts.txt
    echo 'slice,original fn slice' > evaluation/original-sizes.csv
    sed -E 's|^(.*) subjects-wasm-extract-slice/[^/]*/([^/]*)/.*$|\2,\1|' counts.txt >> evaluation/original-sizes.csv
    ```

    The numbers in Table 1 of the paper (the list of programs in the dataset along with their sizes) can be generated as follows.
    For the WebAssembly files, we can count the function size:

    ```
    find subjects-wasm-extract-slice -name \*.c.wat -exec c_count {} \; | grep subjects > counts.txt
    sed -E 's|^(.*) subjects-wasm-extract-slice/([^/]*)/.*$|\1 \2|' counts.txt | python evaluation/table1-wasm-mean.py
    ```

    or the full program size:

    ```
    find subjects-wasm-extract-slice -name t.wat -exec c_count {} \; | grep subjects > counts.txt
    sed -E 's|^(.*) subjects-wasm-extract-slice/([^/]*)/.*$|\1 \2|' counts.txt | python evaluation/table1-wasm-mean.py
    ```

    ## Structure of the dataset

    The dataset is structured as follows:

    - `subjects/` contains the instrumented `.c` source code, along with scripts to generate the dynamic slices.
    The original source code can be obtained by removing the line `printf(" ORBS:%x ....`.
    - `subjects-wasm-extract-slice/` contains the original WebAssembly programs to slice. Each program has two files: `t.wat` is the full binary file, and `name.c.wat` is the binary code of the function containing the slicing criterion.
    - `all_slices/` contains the slices. For example, program `adpcm_ah1_254_expr` has the following files
    - `adpcm/adpcm_ah1_254_expr/EWS_adpcm.wat`: the EWS slice
    - `adpcm/adpcm_ah1_254_expr/SEW_adpcm.wat`: the SEW slice
    - `adpcm/adpcm_ah1_254_expr/ESW_adpcm.wat`: the ESW slice
    - `adpcm/adpcm_ah1_254_expr/static_adpcm.wat.slice`: the SWS slice
    The other files are produced by intermediary steps and can be ignored. They are:
    - `adpcm/adpcm_ah1_254_expr/ESW_adpcm.wat.orig`: original (unsliced) binary *file* from which SW and ESW slices are computed
    - `adpcm/adpcm_ah1_254_expr/SEW_adpcm.wat.orig`: original (unsliced) binary *function* from which SEW slice is computed
    - `adpcm/adpcm_ah1_254_expr/SW_adpcm.wat`: slice of entire binary file from which ESW slice is extracted
    - `adpcm/adpcm_ah1_254_expr/WS_adpcm.wat`: compiled (binary) version of dynamic C slice from which EWS slice is extracted

    # Research questions
    ## RQ1

    The script `./RQ1.py` found in the `evaluation/` directory generates:

    - Figure 3 (time.pdf)
    - The mean, min, max, and stddev of the times
    - How many slices are computed below 10, 100, 1000, and 10000 seconds

    ## RQ2

    The script `./RQ2.py` found in the `evaluation/` directory generates:

    - Figure 4 (loc.pdf)
    - The mean, median, min, max, and stddev of the sizes
    - The largest differences between the approaches
    - The number of slices larger than the original program

    ## RQ3 and RQ4

    The process for these research questions is manual and requires comparing slices.
    It cannot be automated.
    We did make heavy use of `diff --side-by-side` in this analysis.

  15. t

    Tour Recommendation Model

    • test.researchdata.tuwien.ac.at
    • test.researchdata.tuwien.at
    bin, png +1
    Updated May 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar (2025). Tour Recommendation Model [Dataset]. http://doi.org/10.70124/akpf6-8p175
    Explore at:
    bin, png, text/markdownAvailable download formats
    Dataset updated
    May 14, 2025
    Dataset provided by
    TU Wien
    Authors
    Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 28, 2025
    Description

    Dataset Description for Tour Recommendation Model

    Context and Methodology:

    • Research Domain/Project:
      This dataset is part of the Tour Recommendation System project, which focuses on predicting user preferences and ratings for various tourist places and events. It belongs to the field of Machine Learning, specifically applied to Recommender Systems and Predictive Analytics.

    • Purpose:
      The dataset serves as the training and evaluation data for a Decision Tree Regressor model, which predicts ratings (from 1-5) for different tourist destinations based on user preferences. The model can be used to recommend places or events to users based on their predicted ratings.

    • Creation Methodology:
      The dataset was originally collected from a tourism platform where users rated various tourist places and events. The data was preprocessed to remove missing or invalid entries (such as #NAME? in rating columns). It was then split into subsets for training, validation, and testing the model.

    Technical Details:

    • Structure of the Dataset:
      The dataset is stored as a CSV file (user_ratings_dataset.csv) and contains the following columns:

      • place_or_event_id: Unique identifier for each tourist place or event.

      • rating: Rating given by the user, ranging from 1 to 5.

      The data is split into three subsets:

      • Training Set: 80% of the dataset used to train the model.

      • Validation Set: A small portion used for hyperparameter tuning.

      • Test Set: 20% used to evaluate model performance.

    • Folder and File Naming Conventions:
      The dataset files are stored in the following structure:

      • user_ratings_dataset.csv: The original dataset file containing user ratings.

      • tour_recommendation_model.pkl: The saved model after training.

      • actual_vs_predicted_chart.png: A chart comparing actual and predicted ratings.

    • Software Requirements:
      To open and work with this dataset, the following software and libraries are required:

      • Python 3.x

      • Pandas for data manipulation

      • Scikit-learn for training and evaluating machine learning models

      • Matplotlib for chart generation

      • Joblib for saving and loading the trained model

      The dataset can be opened and processed using any Python environment that supports these libraries.

    • Additional Resources:

      • The model training code, README file, and performance chart are available in the project repository.

      • For detailed explanation and code, please refer to the GitHub repository (or any other relevant link for the code).

    Further Details:

    • Dataset Reusability:
      The dataset is structured for easy use in training machine learning models for recommendation systems. Researchers and practitioners can utilize it to:

      • Train other types of models (e.g., regression, classification).

      • Experiment with different features or add more metadata to enrich the dataset.

    • Data Integrity:
      The dataset has been cleaned and preprocessed to remove invalid values (such as #NAME? or missing ratings). However, users should ensure they understand the structure and the preprocessing steps taken before reusing it.

    • Licensing:
      The dataset is provided under the CC BY 4.0 license, which allows free usage, distribution, and modification, provided that proper attribution is given.

  16. H

    Data and Inferences from a Collection of Pan Troglodytes Specimens from...

    • dataverse.harvard.edu
    Updated Mar 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Woodger Faugas (2023). Data and Inferences from a Collection of Pan Troglodytes Specimens from Rwanda, the Republic of Zaire (present-day Democratic Republic of the Congo), and Beyond [Dataset]. http://doi.org/10.7910/DVN/MAIGQE
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 20, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Woodger Faugas
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Democratic Republic of the Congo, Rwanda, Zaire
    Description

    These data granularly explore the substance of collections data of Pan troglodytes (PT) specimens stored at the illustrious Field Museum of Natural History (FMNH) in Chicago, Illinois. Describing a specific dataset in the wake of thorough reviewing and analyzing, these collections reveal patterns that could lead to new research questions and hypotheses, calling for a deepening of our scientific understanding of the species' distribution and biology. Thematically, at its core, this data extraction and analysis emphasize the importance of studying chimpanzees, in light of their close genetic relationship(s) to human beings, in allowing for the unearthing of insights into human health and disease. In this work, Python code for data extraction is also provided, in describing extraction methodology and data categorization. Furthermore, this piece and data highlight some of the barriers associated with studying museum collections, concluding that studying PT collections data from the Field Museum can provide valuable insights that would inform conclusions in ecology, primate biology, biogeography, taxonomy, as well as anatomy, and physiology. The Field Museum of Natural History (FMNH), often referred to as the Field Museum, is an internationally-acclaimed, natural history museum located in one of the U.S.’s most populous cities, i.e., Chicago, Illinois. Named in honor of its primary benefactor as well as University of Chicago founding benefactor Marshall Field, the Museum is renowned for its exceptional scientific and educational programs, as well as its vast assortment of scientific specimens and artifacts. Internationally, the Museum’s zoological collections are regarded as one of the world’s most extensive collections, comprising millions of specimens preserved in dryness, fluids, or ice (e.g., to facilitate anatomical research and DNA analysis). Aiming to extract collections data to gather information on Pan troglodytes (PT), i.e., common chimpanzees, a search of museum records was conducted. The primary objective of this data extraction, and its subsequent analysis, entailed providing significant benefits, including a deeper understanding of the species’ taxonomy, ecology, and evolution. Additionally, surveying and analyzing PT collections aligns with uncovering patterns, which can foster the generation of novel research questions and hypotheses regarding primates and human beings. Furthermore, this work has the potential to document PT’s distribution and range, shedding light on PT interactions with other species and the environment. Fundamentally, therefore, studying PT collections data from a renowned institution like the Field Museum offers valuable insights into ecology, primate biology, biogeography, taxonomy, as well as anatomy and physiology —contributing incrementally to the advancement of scientific knowledge in these realms. Being close genetic relatives to human beings, chimpanzees allow for the harnessing of important insights into human health and disease. For instance, studying the immune system of chimpanzees has led to a new understanding of human immunology. Moreover, studying PT collections data offers the additional benefit of promoting greater cultural understanding. Chimpanzees bear cultural significance in various human societies, and exploring museum collections can therefore enhance people’s comprehension of the cultural significance of these animals. Despite these benefits, some barriers associated with studying museum collections exist. For instance, the FMNH notes that its web database does not represent the museum’s complete zoological holdings, and documentation for specimens may differ based on when and how they were collected, as well as how recently they were acquired. Furthermore, while steps are taken to ensure the accuracy of the information in the FMNH database, per FMNH, some of the collections data may contain inaccuracies. Methodologically, based on the benefits of studying PT collections data, a search was conducted on the FMNH database to extract all available data regarding this primate species. The search criteria focus solely on the species’ name, resulting in 61 specimens with corresponding data. The extracted data, downloaded from the Museum in a comma-separated values (CSV) file format, were categorized by Occurrence ID, Institution Code, Collection Code, Catalog Number, Taxonomic Name, Phylum, Class, Order, Family, State/Province, County, Locality, DwC_Sex, Sex, Tissue Present, IRN, Prep Type, Date Collected, and PriLoan Status. Some columns from the original file, such as IRN, were duplicated and deleted, while other columns, such as Identified By, Type Status, Continent/Ocean, Preservation, Life Stage, Collection Methods, Available Data, Latitude, Longitude, Count, Birds/Mammals Parts, Measurement, and Measurement Value, contained empty cells. An alternative method for producing the CSV file using Python entails using the following lines of code,...

  17. c

    ckanext-ark

    • catalog.civicdataecosystem.org
    Updated Jun 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). ckanext-ark [Dataset]. https://catalog.civicdataecosystem.org/dataset/ckanext-ark
    Explore at:
    Dataset updated
    Jun 4, 2025
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The ARK extension for CKAN provides the capability to mint and resolve ARK (Archival Resource Key) identifiers for datasets. Inspired by the ckanext-doi extension, it facilitates persistent identification of data resources within a CKAN instance, ensuring greater stability and longevity of data references. This extension is compatible with CKAN versions 2.9 and 2.10 and supports Python versions 3.8, 3.9, and 3.10. Key Features: ARK Identifier Minting: Generates unique ARK identifiers for CKAN datasets, helping to guarantee long-term access and citation stability. ARK Identifier Resolving: Enables resolution of ARK identifiers to associated CKAN resources. Configurable ARK Generation: Allows customization of ARK generation through configurable templates and shoulder assignments, providing flexibility in identifier structure. ERC Metadata Mapping: Supports mapping of dataset fields to ERC (Encoding and Rendering Conventions) metadata elements, using configurable mappings to extract data for persistent identification. Command-Line Interface (CLI) Tools: Includes CLI commands to manage (create, update and delete) ARK identifiers for existing datasets. Customizable NMA (Name Mapping Authority) URL: Supports setting up a custom NMA URL, providing the ability to customize the resolver to point towards different data sources than ckan.siteurl. Technical Integration: The ARK extension integrates with CKAN through plugins and modifies the read_base.html template (either in a custom extension or directly) to display the ARK identifier within the dataset's view. Configuration settings, such as the ARK NAAN (Name Assigning Authority Number) and other specific parameters, are defined in the CKAN configuration file (ckan.ini). Database initialization is required after installation to ensure proper functioning. Benefits & Impact: Implementing the ARK extension enhances the data management capabilities of CKAN by providing persistent identifiers for datasets. This ensures that data resources can be reliably cited and accessed over time. By enabling the association of ERC metadata with ARKs, the extension promotes better description and discoverability of data. The extension can be beneficial for institutions that require long-term preservation and persistent identification of their data resources.

  18. GreenDB: A Product-by-Product Sustainability Database

    • zenodo.org
    • explore.openaire.eu
    Updated Oct 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sebastian Jäger; Sebastian Jäger; Alexander Flick; Alexander Flick; Jessica Adriana Sanchez Garcia; Kaspar von den Driesch; Felix Bießmann; Felix Bießmann; Ivana Trajanovska; Ivana Trajanovska; Jessica Adriana Sanchez Garcia; Kaspar von den Driesch (2023). GreenDB: A Product-by-Product Sustainability Database [Dataset]. http://doi.org/10.5281/zenodo.8206328
    Explore at:
    Dataset updated
    Oct 2, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sebastian Jäger; Sebastian Jäger; Alexander Flick; Alexander Flick; Jessica Adriana Sanchez Garcia; Kaspar von den Driesch; Felix Bießmann; Felix Bießmann; Ivana Trajanovska; Ivana Trajanovska; Jessica Adriana Sanchez Garcia; Kaspar von den Driesch
    Description

    The publicly available open GreenDB is a product-by-product sustainability database. It contains product attributes, e.g., name, description, color, etc. and, on the other hand, information about the products' sustainability. Moreover, the sustainability information is transparently evaluated so that it is possible to rank products depending on their sustainability.

    Landing Page: https://green-db.calgo-lab.de
    Source Code: https://github.com/calgo-lab/green-db

    Further Notes

    For usability, we export two datatypes:

    1. CSV
    2. Parquet (using pyarrow python package)

    Be careful with the columns' data types when using the CSV files! Use for example:

    from ast import literal_eval
    from datetime import datetime
    
    import pandas as pd
    
    products = pd.read_csv("products.csv", converters={"timestamp": datetime.fromisoformat, "image_urls": literal_eval, "sustainability_labels": literal_eval}).convert_dtypes()
    sustainability_labels = pd.read_csv("sustainability_labels.csv", converters={"timestamp": datetime.fromisoformat}).convert_dtypes()

  19. Z

    Data from: A dataset of GitHub Actions workflow histories

    • data.niaid.nih.gov
    Updated Oct 25, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cardoen, Guillaume (2024). A dataset of GitHub Actions workflow histories [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10259013
    Explore at:
    Dataset updated
    Oct 25, 2024
    Dataset authored and provided by
    Cardoen, Guillaume
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This replication package accompagnies the dataset and exploratory empirical analysis reported in the paper "A dataset of GitHub Actions workflow histories" published in the IEEE MSR 2024 conference. (The Jupyter notebook can be found in previous version of this dataset).

    Important notice : It looks like Zenodo is compressing gzipped files two times without notice, they are "double compressed". So, when you download them they should be named : x.gz.gz instead of x.gz. Notice that the provided MD5 refers to the original file.

    2024-10-25 update : updated repositories list and observation period. The filters relying on date were also updated.

    2024-07-09 update : fix sometimes invalid valid_yaml flag.

    The dataset was created as follow :

    First, we used GitHub SEART (on October 7th, 2024) to get a list of every non-fork repositories created before January 1st, 2024. having at least 300 commits and at least 100 stars where at least one commit was made after January 1st, 2024. (The goal of these filter is to exclude experimental and personnal repositories).

    We checked if a folder .github/workflows existed. We filtered out those that did not contained this folder and pulled the others (between 9th and 10thof October 2024).

    We applied the tool gigawork (version 1.4.2) to extract every files from this folder. The exact command used is python batch.py -d /ourDataFolder/repositories -e /ourDataFolder/errors -o /ourDataFolder/output -r /ourDataFolder/repositories_everything.csv.gz -- -w /ourDataFolder/workflows_auxiliaries. (The script batch.py can be found on GitHub).

    We concatenated every files in /ourDataFolder/output into a csv (using cat headers.csv output/*.csv > workflows_auxiliaries.csv in /ourDataFolder) and compressed it.

    We added the column uid via a script available on GitHub.

    Finally, we archived the folder with pigz /ourDataFolder/workflows (tar -c --use-compress-program=pigz -f workflows_auxiliaries.tar.gz /ourDataFolder/workflows)

    Using the extracted data, the following files were created :

    workflows.tar.gz contains the dataset of GitHub Actions workflow file histories.

    workflows_auxiliaries.tar.gz is a similar file containing also auxiliary files.

    workflows.csv.gz contains the metadata for the extracted workflow files.

    workflows_auxiliaries.csv.gz is a similar file containing also metadata for auxiliary files.

    repositories.csv.gz contains metadata about the GitHub repositories containing the workflow files. These metadata were extracted using the SEART Search tool.

    The metadata is separated in different columns:

    repository: The repository (author and repository name) from which the workflow was extracted. The separator "/" allows to distinguish between the author and the repository name

    commit_hash: The commit hash returned by git

    author_name: The name of the author that changed this file

    author_email: The email of the author that changed this file

    committer_name: The name of the committer

    committer_email: The email of the committer

    committed_date: The committed date of the commit

    authored_date: The authored date of the commit

    file_path: The path to this file in the repository

    previous_file_path: The path to this file before it has been touched

    file_hash: The name of the related workflow file in the dataset

    previous_file_hash: The name of the related workflow file in the dataset, before it has been touched

    git_change_type: A single letter (A,D, M or R) representing the type of change made to the workflow (Added, Deleted, Modified or Renamed). This letter is given by gitpython and provided as is.

    valid_yaml: A boolean indicating if the file is a valid YAML file.

    probably_workflow: A boolean representing if the file contains the YAML key on and jobs. (Note that it can still be an invalid YAML file).

    valid_workflow: A boolean indicating if the file respect the syntax of GitHub Actions workflow. A freely available JSON Schema (used by gigawork) was used in this goal.

    uid: Unique identifier for a given file surviving modifications and renames. It is generated on the addition of the file and stays the same until the file is deleted. Renamings does not change the identifier.

    Both workflows.csv.gz and workflows_auxiliaries.csv.gz are following this format.

  20. Z

    Data from: CESNET-QUIC22: A large one-month QUIC network traffic dataset...

    • data.niaid.nih.gov
    • explore.openaire.eu
    • +1more
    Updated Feb 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hynek, Karel (2024). CESNET-QUIC22: A large one-month QUIC network traffic dataset from backbone lines [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7409923
    Explore at:
    Dataset updated
    Feb 29, 2024
    Dataset provided by
    Hynek, Karel
    Luxemburk, Jan
    Šiška, Pavel
    Lukačovič, Andrej
    Čejka, Tomáš
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Please refer to the original data article for further data description: Jan Luxemburk et al. CESNET-QUIC22: A large one-month QUIC network traffic dataset from backbone lines, Data in Brief, 2023, 108888, ISSN 2352-3409, https://doi.org/10.1016/j.dib.2023.108888. We recommend using the CESNET DataZoo python library, which facilitates the work with large network traffic datasets. More information about the DataZoo project can be found in the GitHub repository https://github.com/CESNET/cesnet-datazoo. The QUIC (Quick UDP Internet Connection) protocol has the potential to replace TLS over TCP, which is the standard choice for reliable and secure Internet communication. Due to its design that makes the inspection of QUIC handshakes challenging and its usage in HTTP/3, there is an increasing demand for research in QUIC traffic analysis. This dataset contains one month of QUIC traffic collected in an ISP backbone network, which connects 500 large institutions and serves around half a million people. The data are delivered as enriched flows that can be useful for various network monitoring tasks. The provided server names and packet-level information allow research in the encrypted traffic classification area. Moreover, included QUIC versions and user agents (smartphone, web browser, and operating system identifiers) provide information for large-scale QUIC deployment studies. Data capture The data was captured in the flow monitoring infrastructure of the CESNET2 network. The capturing was done for four weeks between 31.10.2022 and 27.11.2022. The following list provides per-week flow count, capture period, and uncompressed size:

    W-2022-44

    Uncompressed Size: 19 GB Capture Period: 31.10.2022 - 6.11.2022 Number of flows: 32.6M W-2022-45

    Uncompressed Size: 25 GB Capture Period: 7.11.2022 - 13.11.2022 Number of flows: 42.6M W-2022-46

    Uncompressed Size: 20 GB Capture Period: 14.11.2022 - 20.11.2022 Number of flows: 33.7M W-2022-47

    Uncompressed Size: 25 GB Capture Period: 21.11.2022 - 27.11.2022 Number of flows: 44.1M CESNET-QUIC22

    Uncompressed Size: 89 GB Capture Period: 31.10.2022 - 27.11.2022 Number of flows: 153M

    Data description The dataset consists of network flows describing encrypted QUIC communications. Flows were created using ipfixprobe flow exporter and are extended with packet metadata sequences, packet histograms, and with fields extracted from the QUIC Initial Packet, which is the first packet of the QUIC connection handshake. The extracted handshake fields are the Server Name Indication (SNI) domain, the used version of the QUIC protocol, and the user agent string that is available in a subset of QUIC communications. Packet Sequences Flows in the dataset are extended with sequences of packet sizes, directions, and inter-packet times. For the packet sizes, we consider payload size after transport headers (UDP headers for the QUIC case). Packet directions are encoded as ±1, +1 meaning a packet sent from client to server, and -1 a packet from server to client. Inter-packet times depend on the location of communicating hosts, their distance, and on the network conditions on the path. However, it is still possible to extract relevant information that correlates with user interactions and, for example, with the time required for an API/server/database to process the received data and generate the response to be sent in the next packet. Packet metadata sequences have a length of 30, which is the default setting of the used flow exporter. We also derive three fields from each packet sequence: its length, time duration, and the number of roundtrips. The roundtrips are counted as the number of changes in the communication direction (from packet directions data); in other words, each client request and server response pair counts as one roundtrip. Flow statistics Flows also include standard flow statistics, which represent aggregated information about the entire bidirectional flow. The fields are: the number of transmitted bytes and packets in both directions, the duration of flow, and packet histograms. Packet histograms include binned counts of packet sizes and inter-packet times of the entire flow in both directions (more information in the PHISTS plugin documentation There are eight bins with a logarithmic scale; the intervals are 0-15, 16-31, 32-63, 64-127, 128-255, 256-511, 512-1024, >1024 [ms or B]. The units are milliseconds for inter-packet times and bytes for packet sizes. Moreover, each flow has its end reason - either it was idle, reached the active timeout, or ended due to other reasons. This corresponds with the official IANA IPFIX-specified values. The FLOW_ENDREASON_OTHER field represents the forced end and lack of resources reasons. The end of flow detected reason is not considered because it is not relevant for UDP connections. Dataset structure The dataset flows are delivered in compressed CSV files. CSV files contain one flow per row; data columns are summarized in the provided list below. For each flow data file, there is a JSON file with the number of saved and seen (before sampling) flows per service and total counts of all received (observed on the CESNET2 network), service (belonging to one of the dataset's services), and saved (provided in the dataset) flows. There is also the stats-week.json file aggregating flow counts of a whole week and the stats-dataset.json file aggregating flow counts for the entire dataset. Flow counts before sampling can be used to compute sampling ratios of individual services and to resample the dataset back to the original service distribution. Moreover, various dataset statistics, such as feature distributions and value counts of QUIC versions and user agents, are provided in the dataset-statistics folder. The mapping between services and service providers is provided in the servicemap.csv file, which also includes SNI domains used for ground truth labeling. The following list describes flow data fields in CSV files:

    ID: Unique identifier SRC_IP: Source IP address DST_IP: Destination IP address DST_ASN: Destination Autonomous System number SRC_PORT: Source port DST_PORT: Destination port PROTOCOL: Transport protocol QUIC_VERSION QUIC: protocol version QUIC_SNI: Server Name Indication domain QUIC_USER_AGENT: User agent string, if available in the QUIC Initial Packet TIME_FIRST: Timestamp of the first packet in format YYYY-MM-DDTHH-MM-SS.ffffff TIME_LAST: Timestamp of the last packet in format YYYY-MM-DDTHH-MM-SS.ffffff DURATION: Duration of the flow in seconds BYTES: Number of transmitted bytes from client to server BYTES_REV: Number of transmitted bytes from server to client PACKETS: Number of packets transmitted from client to server PACKETS_REV: Number of packets transmitted from server to client PPI: Packet metadata sequence in the format: [[inter-packet times], [packet directions], [packet sizes]] PPI_LEN: Number of packets in the PPI sequence PPI_DURATION: Duration of the PPI sequence in seconds PPI_ROUNDTRIPS: Number of roundtrips in the PPI sequence PHIST_SRC_SIZES: Histogram of packet sizes from client to server PHIST_DST_SIZES: Histogram of packet sizes from server to client PHIST_SRC_IPT: Histogram of inter-packet times from client to server PHIST_DST_IPT: Histogram of inter-packet times from server to client APP: Web service label CATEGORY: Service category FLOW_ENDREASON_IDLE: Flow was terminated because it was idle FLOW_ENDREASON_ACTIVE: Flow was terminated because it reached the active timeout FLOW_ENDREASON_OTHER: Flow was terminated for other reasons

    Link to other CESNET datasets

    https://www.liberouter.org/technology-v2/tools-services-datasets/datasets/ https://github.com/CESNET/cesnet-datazoo Please cite the original data article:

    @article{CESNETQUIC22, author = {Jan Luxemburk and Karel Hynek and Tomáš Čejka and Andrej Lukačovič and Pavel Šiška}, title = {CESNET-QUIC22: a large one-month QUIC network traffic dataset from backbone lines}, journal = {Data in Brief}, pages = {108888}, year = {2023}, issn = {2352-3409}, doi = {https://doi.org/10.1016/j.dib.2023.108888}, url = {https://www.sciencedirect.com/science/article/pii/S2352340923000069} }

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2020). ‘Indian Names Dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-indian-names-dataset-65ca/latest

‘Indian Names Dataset’ analyzed by Analyst-2

Explore at:
Dataset updated
Aug 10, 2020
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Analysis of ‘Indian Names Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/ananysharma/indian-names-dataset on 28 January 2022.

--- Dataset description provided by original source is as follows ---

Context

This dataset is useful to me in terms of my project which i was working. Problem was to extract names from unstructured text and i am still working on it.I felt of sharing this as some of the people might find useful in some Named Entity Recognition and other nlp tasks. If you want you can work on how to extract names from unstructured text without any context.For eg if we have to extract names from a document where context is not present.You can share your work and we can work together for better.

Content

The dataset contains a male and female dataset along with a python preprocessing file for merging the two datasets.You can use either of the datset. Or you can see how we can merge both.

Acknowledgements

I get to know this dataset from a github repository which can be visited here

--- Original source retains full ownership of the source dataset ---

Search
Clear search
Close search
Google apps
Main menu