76 datasets found
  1. a

    Creating points from addresses in ArcGIS Online - Points of Interest CSV

    • resources-gisinschools-nz.hub.arcgis.com
    Updated Sep 4, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GIS in Schools - Teaching Materials - New Zealand (2017). Creating points from addresses in ArcGIS Online - Points of Interest CSV [Dataset]. https://resources-gisinschools-nz.hub.arcgis.com/datasets/5d53d4b14ac64843af6d33f186c55ff5
    Explore at:
    Dataset updated
    Sep 4, 2017
    Dataset authored and provided by
    GIS in Schools - Teaching Materials - New Zealand
    Description

    Creating points from addresses in ArcGIS Online lesson. http://arcg.is/2vEljQx

  2. OpenCitations Index CSV dataset of all the citation data

    • figshare.com
    zip
    Updated Jul 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenCitations ​ (2025). OpenCitations Index CSV dataset of all the citation data [Dataset]. http://doi.org/10.6084/m9.figshare.24356626.v6
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 15, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    OpenCitations ​
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset contains all the citation data (in CSV format) included in the OpenCitation Index (https://opencitations.net/index), released on July 10, 2025. In particular, each line of the CSV file defines a citation, and includes the following information:[field "oci"] the Open Citation Identifier (OCI) for the citation;[field "citing"] the OMID of the citing entity;[field "cited"] the OMID of the cited entity;[field "creation"] the creation date of the citation (i.e. the publication date of the citing entity);[field "timespan"] the time span of the citation (i.e. the interval between the publication date of the cited entity and the publication date of the citing entity);[field "journal_sc"] it records whether the citation is a journal self-citations (i.e. the citing and the cited entities are published in the same journal);[field "author_sc"] it records whether the citation is an author self-citation (i.e. the citing and the cited entities have at least one author in common).Note: the information for each citation is sourced from OpenCitations Meta (https://opencitations.net/meta), a database that stores and delivers bibliographic metadata for all bibliographic resources included in the OpenCitations Index. The data provided in this dump is therefore based on the state of OpenCitations Meta at the time this collection was generated.This version of the dataset contains:2,216,426,689 citationsThe size of the zipped archive is 38.8 GB, while the size of the unzipped CSV file is 242 GB.

  3. B

    Residential School Locations Dataset (CSV Format)

    • borealisdata.ca
    • search.dataone.org
    Updated Jun 5, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rosa Orlandini (2019). Residential School Locations Dataset (CSV Format) [Dataset]. http://doi.org/10.5683/SP2/RIYEMU
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 5, 2019
    Dataset provided by
    Borealis
    Authors
    Rosa Orlandini
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 1, 1863 - Jun 30, 1998
    Area covered
    Canada
    Description

    The Residential School Locations Dataset [IRS_Locations.csv] contains the locations (latitude and longitude) of Residential Schools and student hostels operated by the federal government in Canada. All the residential schools and hostels that are listed in the Indian Residential School Settlement Agreement are included in this dataset, as well as several Industrial schools and residential schools that were not part of the IRRSA. This version of the dataset doesn’t include the five schools under the Newfoundland and Labrador Residential Schools Settlement Agreement. The original school location data was created by the Truth and Reconciliation Commission, and was provided to the researcher (Rosa Orlandini) by the National Centre for Truth and Reconciliation in April 2017. The dataset was created by Rosa Orlandini, and builds upon and enhances the previous work of the Truth and Reconcilation Commission, Morgan Hite (creator of the Atlas of Indian Residential Schools in Canada that was produced for the Tk'emlups First Nation and Justice for Day Scholar's Initiative, and Stephanie Pyne (project lead for the Residential Schools Interactive Map). Each individual school location in this dataset is attributed either to RSIM, Morgan Hite, NCTR or Rosa Orlandini. Many schools/hostels had several locations throughout the history of the institution. If the school/hostel moved from its’ original location to another property, then the school is considered to have two unique locations in this dataset,the original location and the new location. For example, Lejac Indian Residential School had two locations while it was operating, Stuart Lake and Fraser Lake. If a new school building was constructed on the same property as the original school building, it isn't considered to be a new location, as is the case of Girouard Indian Residential School.When the precise location is known, the coordinates of the main building are provided, and when the precise location of the building isn’t known, an approximate location is provided. For each residential school institution location, the following information is provided: official names, alternative name, dates of operation, religious affiliation, latitude and longitude coordinates, community location, Indigenous community name, contributor (of the location coordinates), school/institution photo (when available), location point precision, type of school (hostel or residential school) and list of references used to determine the location of the main buildings or sites.

  4. Z

    Co-Creation Database

    • data.niaid.nih.gov
    • zenodo.org
    Updated Aug 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Loisel, Quentin (2023). Co-Creation Database [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6773027
    Explore at:
    Dataset updated
    Aug 5, 2023
    Dataset provided by
    Loisel, Quentin
    Agnello, Danielle
    Chastin, Sebastien
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Co-Creation Database groups scientific references on co-creation.

    It mainly contains the title, abstract, DOI, and authors.

    Two versions are available:

    Version 1.5 includes 13,501 references, from PubMed, ProQuest and CINAHL, from January 1970 to November 2021. Available in RIS (Research Information Systems) format and CSV (CSV UTF-8). Quality metrics: 9.38% false negatives; 20.35% false positives.

    Version 2.0 is an update from a classification model trained with version 1.5. It includes references from Scopus and Web of Science from January 1970 to March 2023, with an update of the previous databases used for version 1.5 from December 2021 to March 2023. Two CSV (CSV UTF-8) files are available. The "Co-Creation Database v2.0 - full.csv" combines the last version, 1.5 and the update, with 52,821 references. The file "Co-Creation Database v2.0 - adding.csv" has only the update, with 39,219 references. Quality metrics: 13.98% false negatives; 36.43% false positives.

    To perform your search: we recommend you extend your search to the title and abstract since some data are initially missing. The RIS file can be uploaded to any references manager (e.g., Zotero, Mendeley, etc.), where you will have the feature to search. For example, here is the link for advanced search instructions in Zotero: https://www.zotero.org/support/searching. Additionally, You can run a Boolean search for CSV files using a Python script.

    To improve the database in further updates: we make available an online form to submit any irrelevant references you may find or to submit any relevant reference not inside the last version. The form is available at the following link: https://forms.office.com/e/6vu9X0kBcw

    It was produced as part of Health CASCADE, a Marie Skłodowska-Curie Innovative Training Network funded by the European Union's Horizon 2020 research and innovation programme under Marie Skłodowska-Curie grant agreement n° 956501.

    The work is made available under the terms of the license CC-BY-NC-4.0 (Creative Common Attribution - NonCommercial - NoDerivatives 4.0 International).

  5. Furniture E-commerce Dataset – 140K+ Product Records with Categories &...

    • crawlfeeds.com
    csv, zip
    Updated Aug 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crawl Feeds (2025). Furniture E-commerce Dataset – 140K+ Product Records with Categories & Breadcrumbs (CSV for AI & NLP) [Dataset]. https://crawlfeeds.com/datasets/furniture-e-commerce-dataset-140k-product-records-with-categories-breadcrumbs-csv-for-ai-nlp
    Explore at:
    zip, csvAvailable download formats
    Dataset updated
    Aug 20, 2025
    Dataset authored and provided by
    Crawl Feeds
    License

    https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy

    Description

    This furniture e-commerce dataset includes 140,000+ structured product records collected from online retail sources. Each entry provides detailed product information, categories, and breadcrumb hierarchies, making it ideal for AI, machine learning, and analytics applications.

    Key Features:

    • 📊 140K+ furniture product records in structured format

    • 🏷 Includes categories, subcategories, and breadcrumbs for taxonomy mapping

    • 📂 Delivered as a clean CSV file for easy integration

    • 🔎 Perfect dataset for AI, NLP, and machine learning model training

    Best Use Cases:
    LLM training & fine-tuning with domain-specific data
    Product classification datasets for AI models
    Recommendation engines & personalization in e-commerce
    Market research & furniture retail analytics
    Search optimization & taxonomy enrichment

    Why this dataset?

    • Large volume (140K+ furniture records) for robust training

    • Real-world e-commerce product data

    • Ready-to-use CSV, saving preprocessing time

    • Affordable licensing with bulk discounts for enterprise buyers

    Note:
    Each record in this dataset includes both a url (main product page) and a buy_url (the actual purchase page).
    The dataset is structured so that records are based on the buy_url, ensuring you get unique, actionable product-level data instead of just generic landing pages.

  6. WECC ADS 2034 Hydropower Generation Datasets

    • zenodo.org
    csv, pdf
    Updated May 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nathalie Voisin; Nathalie Voisin; Daniel Broman; Daniel Broman; Kerry Abernethy-Cannella; Kerry Abernethy-Cannella; Cameron Bracken; Cameron Bracken; Youngjun Son; Youngjun Son; Kevin Harris; Kevin Harris (2025). WECC ADS 2034 Hydropower Generation Datasets [Dataset]. http://doi.org/10.5281/zenodo.12617457
    Explore at:
    csv, pdfAvailable download formats
    Dataset updated
    May 14, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nathalie Voisin; Nathalie Voisin; Daniel Broman; Daniel Broman; Kerry Abernethy-Cannella; Kerry Abernethy-Cannella; Cameron Bracken; Cameron Bracken; Youngjun Son; Youngjun Son; Kevin Harris; Kevin Harris
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Every two years the WECC (Western Electricity Coordinating Council) releases an Anchor Data Set (ADS) to be analyzed with a Production Cost Models (PCM) and which represents the expected loads, resources, and transmission topology 10 years in the future from a given reference year. For hydropower resources, the WECC relies on members to provide data to parameterize the hydropower representation in production cost models. The datasets consist of plant-level hydropower generation, flexibility, ramping, and mode of operations and are tied to the hydropower representation in those production cost models.

    In 2022, PNNL supported the WECC by developing the WECC ADS 2032 hydropower dataset [1]. The WECC ADS 2032 hydropower dataset (generation and flexibility) included an update of the climate year conditions (2018 calendar year), consistency in representation across the entire US WECC footprint, updated hydropower operations over the core Columbia River, and a higher temporal resolution (weekly instead of monthly)[1] associated with a GridView software update (weekly hydro logic). Proprietary WECC utility hydropower data were used when available to develop the monthly and weekly datasets and were completed with HydroWIRES B1 methods to develop the Hydro 923 plus (now RectifHydPlus weekly hydropower dataset) [2] and the flexibility parameterization [3]. The team worked with Bonneville Power Administration to develop hydropower datasets over the core Columbia River representative of the post-2018 change in environmental regulation (flex spill). Ramping data are considered proprietary, were leveraged from WECC ADS 2030, and were not provided in the release, nor are the WECC-member hydropower data.

    This release represents the WECC ADS 2034 hydropower dataset. The generator database was first updated by WECC. Based on a review of hourly generation profiles, 16 facilities were transitioned from fixed schedule to dispatchable (380.5MW). The operations of the core Columbia River were updated based on Bonneville Power Administration's long-term hydro-modeling using 2020-level of modified flows and using fiscal year 2031 expected operations. The update was necessary to reflect the new environmental regulation (EIS2023). The team also included a newly developed extension over Canada [4] that improves upon existing data and synchronizes the US and Canadian data to the same 2018 weather year. Canadian facilities over the Peace River were not updated due to a lack of available flow data. The team was able to modernize and improve the overall data processing using modern tools as well as provide thorough documentation and reproducible workflows [5,6]. The datasets have been incorporated into the 2034 ADS and are in active use by WECC and the community.

    WECC ADS 2034 hydropower datasets contain generation at weekly and monthly timesteps, for US hydropower plants, monthly generation for Canadian hydropower plants, and the two merged together. Separate datasets are included for generation by hydropower plant and generation by individual generator units. Only processed data are provided. Original WECC-utility hourly data are under a non-disclosure agreement and for the sole use of developing this dataset.

    [1] Voisin, N., Harris, K. M., Oikonomou, K., Turner, S., Johnson, A., Wallace, S., Racht, P., et al. (2022). WECC ADS 2032 Hydropower Dataset (PNNL-SA-172734). See presentation (Voisin N., K.M. Harris, K. Oikonomou, and S. Turner. 04/05/2022. "WECC 2032 Anchor Dataset - Hydropower." Presented by N. Voisin, K. Oikonomou at WECC Production Cost Model Dataset Subcommittee Meeting, Online, Utah. PNNL-SA-171897.).

    [2] Turner, S. W. D., Voisin, N., Oikonomou, K., & Bracken, C. (2023). Hydro 923: Monthly and Weekly Hydropower Constraints Based on Disaggregated EIA-923 Data (v1.1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.8212727

    [3] Stark, G., Barrows, C., Dalvi, S., Guo, N., Michelettey, P., Trina, E., Watson, A., Voisin, N., Turner, S., Oikonomou, K. and Colotelo, A. 2023 Improving the Representation of Hydropower in Production Cost Models, NREL/TP-5700-86377, United States. https://www.osti.gov/biblio/1993943

    [4] Son, Y., Bracken, C., Broman, D., & Voisin, N. (2025). Monthly Hydropower Generation Dataset for Western Canada (1.1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.14984725

    [5] https://github.com/HydroWIRES-PNNL/weccadshydro/

    [6] Voisin, N., Broman, D., Abernethy-Cannella, K., Bracken, C., Son, Y., & Harris, K. (2025). WECC ADS 2034 Hydropower Generation Code (weccadshydro). Zenodo. https://doi.org/10.5281/zenodo.15417594

    Dataset Files:

    FileDescriptionTimestepSpatial Extent
    US_Monthly_Plant.csvGeneration data for US plants at a monthly timestepMonthlyUS
    US_Weekly_Plant.csvGeneration data for US plants at a weekly timestepWeeklyUS
    US_Monthly_Unit.csvGeneration data for US plants by generator units at a monthly timestepMonthlyUS
    US_Weekly_Unit.csvGeneration data for US plants by generator units at a weekly timestepWeeklyUS
    Canada_Monthly_Plant.csvGeneration data for Canadian plants at a monthly timestepMonthlyCanada
    Canada_Monthly_Unit.csvGeneration data for Canadian plants by generator units at a monthly timestepMonthlyCanada
    Merged_Monthly_Plant.csvGeneration data for US and Canadian plants at a monthly timestepMonthlyUS and Canada
    Merged_Monthly_Unit.csvGeneration data for US and Canadian plants by generator units at a monthly timestepMonthlyUS and Canada
    Overview presentation of the WECC ADS 2034 datasetN/AN/A
    PNNL-SA-171897.pdfOverview presentation of the WECC ADS 2032 datasetN/AN/A

    Data Description:

    Each dataset contains the following column headers:

    Column NameUnitDescription
    SourceN/AIndicates the method used to develop the data (see below)
    Generator NameN/AGenerator name used in WECC PCM (in unit datasets)
    EIA IDN/AEnergy Information Administration (EIA) plant ID (in plant datasets)
    DataTypeNameN/AData type (see below)
    DatatypeIDN/AData type ID
    YearyearYear (not used)
    Week1 [Month1]MWhgeneration MWh value for data type; subsequent week or month columns contain data for each week or month in the dataset period

    Data Source (Method)

    The dataset contains data from four different data sources, developed using different methods:

    <td style="padding: .75pt .75pt

    SourceDescription
    PNNL

    Weekly / monthly aggregation performed by PNNL using hourly observed facility-scale generation provided in 2022 by asset owners for year 2018

    BPA

    BPA long-term hydromodeling (HYDSIM) with 2020-Level Modified Flows for Water Years 1989-2018 Using FY 2031 expected operations (EIS2023). Jan-Sept comes from 2018 and Oct-Dec from year 2007.
    Weekly disaggregation performed by PNNL based on daily observed 2018 flow. Hourly flexibility was evaluated by PNNL using hourly observed facility-scale generation in years 2018, 2019 and 2021.

    CAISO

    Weekly / monthly aggregation performed by CAISO using hourly observed facility-scale generation for 2018. Daily flexibility also directly provided by CAISO

    Canada
  7. g

    WFS NW ALKIS Owner Data Simplified Schema CSV

    • gimi9.com
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    WFS NW ALKIS Owner Data Simplified Schema CSV [Dataset]. https://gimi9.com/dataset/eu_5d209663-ccbc-44d9-a413-9c35c1c15d67/
    Explore at:
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The property register is kept in electronic form in the Official Property Register Information System (ALKIS). The present Web Feature Service enables the targeted download of geo-objects in ALKIS based on a search query (direct access download service). The Service only provides the following geo-objects limited to the essential properties in the format of a simplified data exchange scheme defined in the “AdV product specification ALKIS-WFS and output formats (Shape, CSV)” (see www.adv-online.de): Plots of land [including owners]. The service is designed for use in simple, practical GIS clients without complex functionalities. Output format is CSV. If multiple attributes (e.g. owners) are present in the feature type Landstueck, so many objects/datasets are output for a parcel until all attributes are unique. The service includes personal data and is secured (in the LVN NRW), registration is required for access. Status of the data used: 01.04.2022.

  8. S

    Depression prediction dataset based on online medical consultation

    • scidb.cn
    Updated Mar 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nie Hui (2023). Depression prediction dataset based on online medical consultation [Dataset]. http://doi.org/10.57760/sciencedb.07706
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 13, 2023
    Dataset provided by
    Science Data Bank
    Authors
    Nie Hui
    Description

    The relevant features of the LIWC psychological dictionary are extracted from the consultation text after preprocessing the depression consultation data collected from the online consultation platform File name: DepressionLevelPrediction-LIWC-Processed.csv Creation time: 2022-12-20 Function: explore the relationship between LIWC-based features and depression Data volume: 3859 Data format: utf8 Field description: ID: consultation record code Depression: degree of depression (3: severe; 2: moderate; 1: mild; 0: undiagnosed) Age: age Gender: gender (1: male 0: female) Region: Region (temporarily unused) Identity: Identity (not used temporarily) Socialize: sociality Emotion: Emotion Cognition: cognition Perception: Perception Physiology: physiology Gains or losses

  9. Ly_Protopopova_Intake Data_Raw Data.csv

    • figshare.com
    txt
    Updated Dec 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lexis Ly (2022). Ly_Protopopova_Intake Data_Raw Data.csv [Dataset]. http://doi.org/10.6084/m9.figshare.21706211.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Dec 9, 2022
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Lexis Ly
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Ly, L.H., Protopopova, A. (2022). A mixed-method analysis of the consistency of intake information reported by shelter staff upon owner surrender of dogs.

    This dataset is the raw file from an online experiment assessing the agreement in data input for surrender reason, breed, and colour across shelter staff when presented with four complex narratives of fictional owners surrendering dogs.

  10. OA Tide Data CSV

    • noaa.hub.arcgis.com
    Updated Aug 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NOAA GeoPlatform (2024). OA Tide Data CSV [Dataset]. https://noaa.hub.arcgis.com/datasets/d7d75e0568154ef3b48632ed70a8fbe1
    Explore at:
    Dataset updated
    Aug 13, 2024
    Dataset provided by
    National Oceanic and Atmospheric Administrationhttp://www.noaa.gov/
    Authors
    NOAA GeoPlatform
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Description

    Data in the Classroom is an online curriculum to foster data literacy. This Ocean Acidification module is geared towards grades 8-12. Visit Data in the Classroom for more information.This application is the Ocean Acidification module.This module was developed to engage students in increasingly sophisticated modes of understanding and manipulation of data. It was completed prior to the release of the Next Generation Science Standards (NGSS)* and has recently been adapted to incorporate some of the innovations described in the NGSS.Each level of the module provides learning experiences that engage students in the three dimensions of the NGSS Framework while building towards competency in targeted performance expectations. Note: this document identifies the specific practice, core idea and concept directly associated with a performance expectation (shown in parentheses in the tables) but also includes additional practices and concepts that can help students build toward a standard.*NGSS Lead States. 2013. Next Generation Science Standards: For States, By States. Washington, DC: The National Academies Press. Next Generation Science Standards is a registered trademark of Achieve. Neither Achieve nor the lead states and partners that developed the Next Generation Science Standards was involved in the production of, and does not endorse, this product.

  11. Datasets for Sentiment Analysis

    • zenodo.org
    csv
    Updated Dec 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias (2023). Datasets for Sentiment Analysis [Dataset]. http://doi.org/10.5281/zenodo.10157504
    Explore at:
    csvAvailable download formats
    Dataset updated
    Dec 10, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository was created for my Master's thesis in Computational Intelligence and Internet of Things at the University of Córdoba, Spain. The purpose of this repository is to store the datasets found that were used in some of the studies that served as research material for this Master's thesis. Also, the datasets used in the experimental part of this work are included.

    Below are the datasets specified, along with the details of their references, authors, and download sources.

    ----------- STS-Gold Dataset ----------------

    The dataset consists of 2026 tweets. The file consists of 3 columns: id, polarity, and tweet. The three columns denote the unique id, polarity index of the text and the tweet text respectively.

    Reference: Saif, H., Fernandez, M., He, Y., & Alani, H. (2013). Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold.

    File name: sts_gold_tweet.csv

    ----------- Amazon Sales Dataset ----------------

    This dataset is having the data of 1K+ Amazon Product's Ratings and Reviews as per their details listed on the official website of Amazon. The data was scraped in the month of January 2023 from the Official Website of Amazon.

    Owner: Karkavelraja J., Postgraduate student at Puducherry Technological University (Puducherry, Puducherry, India)

    Features:

    • product_id - Product ID
    • product_name - Name of the Product
    • category - Category of the Product
    • discounted_price - Discounted Price of the Product
    • actual_price - Actual Price of the Product
    • discount_percentage - Percentage of Discount for the Product
    • rating - Rating of the Product
    • rating_count - Number of people who voted for the Amazon rating
    • about_product - Description about the Product
    • user_id - ID of the user who wrote review for the Product
    • user_name - Name of the user who wrote review for the Product
    • review_id - ID of the user review
    • review_title - Short review
    • review_content - Long review
    • img_link - Image Link of the Product
    • product_link - Official Website Link of the Product

    License: CC BY-NC-SA 4.0

    File name: amazon.csv

    ----------- Rotten Tomatoes Reviews Dataset ----------------

    This rating inference dataset is a sentiment classification dataset, containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. On average, these reviews consist of 21 words. The first 5331 rows contains only negative samples and the last 5331 rows contain only positive samples, thus the data should be shuffled before usage.

    This data is collected from https://www.cs.cornell.edu/people/pabo/movie-review-data/ as a txt file and converted into a csv file. The file consists of 2 columns: reviews and labels (1 for fresh (good) and 0 for rotten (bad)).

    Reference: Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 115–124, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics

    File name: data_rt.csv

    ----------- Preprocessed Dataset Sentiment Analysis ----------------

    Preprocessed amazon product review data of Gen3EcoDot (Alexa) scrapped entirely from amazon.in
    Stemmed and lemmatized using nltk.
    Sentiment labels are generated using TextBlob polarity scores.

    The file consists of 4 columns: index, review (stemmed and lemmatized review using nltk), polarity (score) and division (categorical label generated using polarity score).

    DOI: 10.34740/kaggle/dsv/3877817

    Citation: @misc{pradeesh arumadi_2022, title={Preprocessed Dataset Sentiment Analysis}, url={https://www.kaggle.com/dsv/3877817}, DOI={10.34740/KAGGLE/DSV/3877817}, publisher={Kaggle}, author={Pradeesh Arumadi}, year={2022} }

    This dataset was used in the experimental phase of my research.

    File name: EcoPreprocessed.csv

    ----------- Amazon Earphones Reviews ----------------

    This dataset consists of a 9930 Amazon reviews, star ratings, for 10 latest (as of mid-2019) bluetooth earphone devices for learning how to train Machine for sentiment analysis.

    This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.

    The file consists of 5 columns: ReviewTitle, ReviewBody, ReviewStar, Product and division (manually added - categorical label generated using ReviewStar score)

    License: U.S. Government Works

    Source: www.amazon.in

    File name (original): AllProductReviews.csv (contains 14337 reviews)

    File name (edited - used for my research) : AllProductReviews2.csv (contains 9930 reviews)

    ----------- Amazon Musical Instruments Reviews ----------------

    This dataset contains 7137 comments/reviews of different musical instruments coming from Amazon.

    This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.

    The file consists of 10 columns: reviewerID, asin (ID of the product), reviewerName, helpful (helpfulness rating of the review), reviewText, overall (rating of the product), summary (summary of the review), unixReviewTime (time of the review - unix time), reviewTime (time of the review (raw) and division (manually added - categorical label generated using overall score).

    Source: http://jmcauley.ucsd.edu/data/amazon/

    File name (original): Musical_instruments_reviews.csv (contains 10261 reviews)

    File name (edited - used for my research) : Musical_instruments_reviews2.csv (contains 7137 reviews)

  12. Online Sales Dataset - Popular Marketplace Data

    • kaggle.com
    Updated May 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ShreyanshVerma27 (2024). Online Sales Dataset - Popular Marketplace Data [Dataset]. https://www.kaggle.com/datasets/shreyanshverma27/online-sales-dataset-popular-marketplace-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 25, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    ShreyanshVerma27
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset provides a comprehensive overview of online sales transactions across different product categories. Each row represents a single transaction with detailed information such as the order ID, date, category, product name, quantity sold, unit price, total price, region, and payment method.

    Columns:

    • Order ID: Unique identifier for each sales order.
    • Date:Date of the sales transaction.
    • Category:Broad category of the product sold (e.g., Electronics, Home Appliances, Clothing, Books, Beauty Products, Sports).
    • Product Name:Specific name or model of the product sold.
    • Quantity:Number of units of the product sold in the transaction.
    • Unit Price:Price of one unit of the product.
    • Total Price: Total revenue generated from the sales transaction (Quantity * Unit Price).
    • Region:Geographic region where the transaction occurred (e.g., North America, Europe, Asia).
    • Payment Method: Method used for payment (e.g., Credit Card, PayPal, Debit Card).

    Insights:

    • 1. Analyze sales trends over time to identify seasonal patterns or growth opportunities.
    • 2. Explore the popularity of different product categories across regions.
    • 3. Investigate the impact of payment methods on sales volume or revenue.
    • 4. Identify top-selling products within each category to optimize inventory and marketing strategies.
    • 5. Evaluate the performance of specific products or categories in different regions to tailor marketing campaigns accordingly.
  13. d

    Replication Data for: Revisiting 'The Rise and Decline' in a Population of...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TeBlunthuis, Nathan; Aaron Shaw; Benjamin Mako Hill (2023). Replication Data for: Revisiting 'The Rise and Decline' in a Population of Peer Production Projects [Dataset]. http://doi.org/10.7910/DVN/SG3LP1
    Explore at:
    Dataset updated
    Nov 22, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    TeBlunthuis, Nathan; Aaron Shaw; Benjamin Mako Hill
    Description

    This archive contains code and data for reproducing the analysis for “Replication Data for Revisiting ‘The Rise and Decline’ in a Population of Peer Production Projects”. Depending on what you hope to do with the data you probabbly do not want to download all of the files. Depending on your computation resources you may not be able to run all stages of the analysis. The code for all stages of the analysis, including typesetting the manuscript and running the analysis, is in code.tar. If you only want to run the final analysis or to play with datasets used in the analysis of the paper, you want intermediate_data.7z or the uncompressed tab and csv files. The data files are created in a four-stage process. The first stage uses the program “wikiq” to parse mediawiki xml dumps and create tsv files that have edit data for each wiki. The second stage generates all.edits.RDS file which combines these tsvs into a dataset of edits from all the wikis. This file is expensive to generate and at 1.5GB is pretty big. The third stage builds smaller intermediate files that contain the analytical variables from these tsv files. The fourth stage uses the intermediate files to generate smaller RDS files that contain the results. Finally, knitr and latex typeset the manuscript. A stage will only run if the outputs from the previous stages do not exist. So if the intermediate files exist they will not be regenerated. Only the final analysis will run. The exception is that stage 4, fitting models and generating plots, always runs. If you only want to replicate from the second stage onward, you want wikiq_tsvs.7z. If you want to replicate everything, you want wikia_mediawiki_xml_dumps.7z.001 wikia_mediawiki_xml_dumps.7z.002, and wikia_mediawiki_xml_dumps.7z.003. These instructions work backwards from building the manuscript using knitr, loading the datasets, running the analysis, to building the intermediate datasets. Building the manuscript using knitr This requires working latex, latexmk, and knitr installations. Depending on your operating system you might install these packages in different ways. On Debian Linux you can run apt install r-cran-knitr latexmk texlive-latex-extra. Alternatively, you can upload the necessary files to a project on Overleaf.com. Download code.tar. This has everything you need to typeset the manuscript. Unpack the tar archive. On a unix system this can be done by running tar xf code.tar. Navigate to code/paper_source. Install R dependencies. In R. run install.packages(c("data.table","scales","ggplot2","lubridate","texreg")) On a unix system you should be able to run make to build the manuscript generalizable_wiki.pdf. Otherwise you should try uploading all of the files (including the tables, figure, and knitr folders) to a new project on Overleaf.com. Loading intermediate datasets The intermediate datasets are found in the intermediate_data.7z archive. They can be extracted on a unix system using the command 7z x intermediate_data.7z. The files are 95MB uncompressed. These are RDS (R data set) files and can be loaded in R using the readRDS. For example newcomer.ds <- readRDS("newcomers.RDS"). If you wish to work with these datasets using a tool other than R, you might prefer to work with the .tab files. Running the analysis Fitting the models may not work on machines with less than 32GB of RAM. If you have trouble, you may find the functions in lib-01-sample-datasets.R useful to create stratified samples of data for fitting models. See line 89 of 02_model_newcomer_survival.R for an example. Download code.tar and intermediate_data.7z to your working folder and extract both archives. On a unix system this can be done with the command tar xf code.tar && 7z x intermediate_data.7z. Install R dependencies. install.packages(c("data.table","ggplot2","urltools","texreg","optimx","lme4","bootstrap","scales","effects","lubridate","devtools","roxygen2")). On a unix system you can simply run regen.all.sh to fit the models, build the plots and create the RDS files. Generating datasets Building the intermediate files The intermediate files are generated from all.edits.RDS. This process requires about 20GB of memory. Download all.edits.RDS, userroles_data.7z,selected.wikis.csv, and code.tar. Unpack code.tar and userroles_data.7z. On a unix system this can be done using tar xf code.tar && 7z x userroles_data.7z. Install R dependencies. In R run install.packages(c("data.table","ggplot2","urltools","texreg","optimx","lme4","bootstrap","scales","effects","lubridate","devtools","roxygen2")). Run 01_build_datasets.R. Building all.edits.RDS The intermediate RDS files used in the analysis are created from all.edits.RDS. To replicate building all.edits.RDS, you only need to run 01_build_datasets.R when the int... Visit https://dataone.org/datasets/sha256%3Acfa4980c107154267d8eb6dc0753ed0fde655a73a062c0c2f5af33f237da3437 for complete metadata about this dataset.

  14. plan4res - public dataset for case study 1 part MIM-1: time series used for...

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Jun 9, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dieter Most; Dieter Most (2020). plan4res - public dataset for case study 1 part MIM-1: time series used for multi-modal investment pathway modelling [Dataset]. http://doi.org/10.5281/zenodo.3885481
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 9, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Dieter Most; Dieter Most
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Public data set which is used within the plan4res project for performing case study 1 "Multi-modal European energy concept for achiving COP21" - Multi-modal Investment modelling (MIM) Part 1: Time series for the reference year 2015

    The related documentation is included in plan4res' deliverable D4.5 chapter 3.2 (see 10.5281/zenodo.3785010)

    The data set includes the following data:

    a) characteristic annual load profiles for large industrial heat demand for chemical, iron & steel, food & beverage and pulp & paper industries for the reference year 2015

    HOTMAPS_TD_OUT_D_CHEM_20200608T160653_20200422T120000Z_v01.csv
    HOTMAPS_TD_OUT_D_FOOD_20200608T160724_20200422T120000Z_v01.csv
    HOTMAPS_TD_OUT_D_IRON_20200608T160705_20200422T120000Z_v01.csv HOTMAPS_TD_OUT_D_PAPER_20200608T160715_20200422T120000Z_v01.csv

    b) characteristic demand profiles for road-side car passenger transport and availability of cars for charging while (home) parking for the reference year 2015

    SIEMENS_TD_OUT_D_RoadCar_20200608T160627_20200401T120000Z_v01.csv SIEMENS_TD_CAP_CarPark_20200608T160637_20200401T120000Z_v01.csv

    c) load profiles for exogeneous demand of electricity for the reference year 2015. The exogenous demand includes all electricity consumptions not explicitly modeled within MIM modeling.

    HRE4_TD_OUT_ElectricityExo_20200608T160732_20200401T120000Z_v01.csv

    c) regionally resolved demand profiles for (individual) space heating and space cooling for the reference year 2015

    HRE4_TRD_CAP_Cool_2015_20200608T160051_20200401T120000Z_v01.csv
    HRE4_TRD_CAP_HeatInd_2015_20200608T155849_20200401T120000Z_v01.csv

    d) regionally resolved generation profiles of electricity from photovoltaic, wind onshore, wind offshore, hydro run-of-river, and for heat generation from solar thermal for the reference year 2015

    NINJA_TRD_CAP_PV_2015_20200608T160440_20191104T120000Z_v01.csv
    NINJA_TRD_CAP_WindOFF_2015_20200608T155422_20191104T120000Z_v01.csv
    NINJA_TRD_CAP_WindON_2015_20200608T155251_20191104T120000Z_v01.csv HRE4_TRD_CAP_HydroRoR_2015_20200608T155550_20200401T120000Z_v01.csv
    HRE4_TRD_CAP_SolarThermal_2015_20200608T155718_20200401T120000Z_v01.csv

    e) regionally resolved generation profile of electricity from wind offshore transformed in a way to represent potential capacity factors in future as stated by doi:10.2760/041705. Data based on reference year 2015

    SIEMENS_TRD_CAP_WindOFF_2040_20200608T155127_20200401T120000Z_v01.csv

    x) A list of geographical description of the zone hierarchy data used in MIM for the EU33 region set.:

    SIEMENS_ZoneHierarchy_MIM_EU33_20181231T120000Z_20200131T1200000Z_v001.csv

    Further info:

    Time series are based on historical data for the reference year 2015.

    Values are normalized over one reference year in a way that either the maximum = 1 (CAP) or the integral = 1 (OUT).

    All values are listed in arbitrary units.

    All country names are according to ISO 3166-1 alpha-2.

  15. n

    CSV and JSON data describing the quantity and content of uploads to...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Sep 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oliver Schiffmann (2023). CSV and JSON data describing the quantity and content of uploads to Thingiverse for 2015-2020 [Dataset]. http://doi.org/10.5061/dryad.gb5mkkwvr
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 8, 2023
    Dataset provided by
    University of Bristol
    Authors
    Oliver Schiffmann
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    The COVID-19 pandemic profoundly affected various aspects of daily life, particularly the supply and demand of essential goods, resulting in critical shortages. This included personal protective equipment (PPE) for medical professionals and the general public. To address these shortages, online "maker communities" emerged, aiming to develop and locally manufacture critical products. While some organized efforts existed, the majority of initiatives originated from individuals and groups on platforms like Thingiverse. This paper presents a longitudinal analysis of Thingiverse, one of the largest maker community websites, to examine the pandemic's effects. Our findings reveal a surge in community output during the initial lockdown periods in major contributing nations (primarily those in the western-hemisphere), followed by a subsequent decline. Additionally, throughout 2020, pandemic-related products dominated uploads and interactions during this period. Based on these observations, we propose recommendations to expedite the community's ability to support local, national, and international responses to future disasters. Methods Collected using the Thingiverse API, some has been processed into csv to make it easier to handle.

  16. OA Level 5 Gulf of Maine CSV

    • noaa.hub.arcgis.com
    Updated Aug 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NOAA GeoPlatform (2024). OA Level 5 Gulf of Maine CSV [Dataset]. https://noaa.hub.arcgis.com/datasets/7ad713049852406c9d9534625966fa68
    Explore at:
    Dataset updated
    Aug 13, 2024
    Dataset provided by
    National Oceanic and Atmospheric Administrationhttp://www.noaa.gov/
    Authors
    NOAA GeoPlatform
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Description

    Data in the Classroom is an online curriculum to foster data literacy. This Ocean Acidification module is geared towards grades 8-12. Visit Data in the Classroom for more information.This application is the Ocean Acidification module.This module was developed to engage students in increasingly sophisticated modes of understanding and manipulation of data. It was completed prior to the release of the Next Generation Science Standards (NGSS)* and has recently been adapted to incorporate some of the innovations described in the NGSS.Each level of the module provides learning experiences that engage students in the three dimensions of the NGSS Framework while building towards competency in targeted performance expectations. Note: this document identifies the specific practice, core idea and concept directly associated with a performance expectation (shown in parentheses in the tables) but also includes additional practices and concepts that can help students build toward a standard.*NGSS Lead States. 2013. Next Generation Science Standards: For States, By States. Washington, DC: The National Academies Press. Next Generation Science Standards is a registered trademark of Achieve. Neither Achieve nor the lead states and partners that developed the Next Generation Science Standards was involved in the production of, and does not endorse, this product.

  17. h

    heaven_dataset_v2

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SafeCircle, heaven_dataset_v2 [Dataset]. https://huggingface.co/datasets/safecircleai/heaven_dataset_v2
    Explore at:
    Dataset authored and provided by
    SafeCircle
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Heaven Dataset (Refined)

      Dataset Overview
    

    The Heaven Dataset (Refined) is a collection of messages with classifications related to predatory behavior detection in online conversations. This dataset is designed to help train and evaluate AI models that can identify potentially harmful communication patterns directed at minors.

      Dataset Description
    
    
    
    
    
      General Information
    

    Dataset Name: Heaven Dataset (Refined) Version: 1.0 File Format: CSV Creation Date:… See the full description on the dataset page: https://huggingface.co/datasets/safecircleai/heaven_dataset_v2.

  18. Input files for Dispa-SET for the JRC report "Power System Flexibility in a...

    • zenodo.org
    • data.niaid.nih.gov
    bin, zip
    Updated Apr 30, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matteo De Felice; Matteo De Felice (2020). Input files for Dispa-SET for the JRC report "Power System Flexibility in a variable climate" [Dataset]. http://doi.org/10.5281/zenodo.3775569
    Explore at:
    zip, binAvailable download formats
    Dataset updated
    Apr 30, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Matteo De Felice; Matteo De Felice
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Input files for Dispa-SET for the JRC report "Power System Flexibility in a variable climate"

    Here you can find the input files needed to reproduce the results of the report:

    De Felice, M., Busch, S., Kanellopoulos, K., Kavvadias, K. and Hidalgo Gonzalez, I., Power system flexibility in a variable climate, EUR 30184 EN, Publications Office of the European Union, Luxembourg, 2020, ISBN 978-92-76-18183-5 (online), doi:10.2760/75312 (online), JRC120338.
    

    The results in the report are generated with the Dispa-SET power system model, available and explained at www.dispaset.eu.

    A description of the data sources with the references can be found into the report.

    How to use this dataset

    This dataset can be used as input data for the Dispa-SET model. We refer to the report and the official model documentation for information about the data and the model.

    Description of the dataset

    The file EnVarClim.yml is a template of the YAML configuration file used by Dispa-SET. To run a specific climate year the XXXX present in some input files must be replaced with the year.

    Availability factors

    In the folder AvailabilityFactors there are the availability factors (from 0 to 1) for the power plants and the renewable generation. There is a subfolder for each simulated zone and inside a file for each climate year: from emh_and_cc_availability_1990.csv to emh_and_cc_availability_2015.csv.

    Cross-border transmission

    In the folder DayAheadNTC there is the file merged_constant_NTC.csv containing the capacity (in MW).

    NOTE: due to an error in the pre-processing code there are some additional lines for the Western Balkans countries ending with a 1 (e.g. GR -> MK1). Those lines are ignored by the model because are not associated to any simulated zone.

    Cross-border historical flows

    In the file CC_L_flows.csv under the folder Flows are contained the hourly flows between the simulated zones and their neighbours (RU, TR, UA).

    Fuel prices

    In the folder FuelPrices are contained a set of files containing the hourly prices for the fuels (biomass, coal, lignite, gas, oil) and CO2 emissions. It is worth noting that in spite of their hourly resolution the time-series are constant through the year.

    Hourly load

    In the folder Load_RealTime there are hourly load time-series for each zone considering a different climate year. For the Western Balkans countries we use the same time-series for each climate year.

    Outage factors

    The files CC_L_outages.csv in the folder OutageFactors contain the outage factor (from 1, full outage, to 0) for the various generation units. Whenever a simulation zone is missing the model assumes the absence of outages.

    Power plants data

    In the folder PowerPlants there is a file named CC_L_plants.mip.csv for each simulated zone. The CSV files contain the data needed by Dispa-SET.

    Water storage levels

    The folder ReservoirLevel contains the storage level (values from 0 to 1 relative to the size of the storage) for all the simulated zones. The levels have been computed for each climate year using a different inflow using the mid-term scheduler recently implemented in Dispa-SET. For the Western Balkans countries we use the same time-series for each climate year.

    Hydro-power inflows

    In the folder ScaledInflows are contained the inflows used for the hydro-power generation. The values in the CSV files describes how much energy is available for hydro-power generation compared to the installed capacity.

    Linked resources

  19. Dataset for the paper: "Monant Medical Misinformation Dataset: Mapping...

    • zenodo.org
    • data.niaid.nih.gov
    Updated Apr 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ivan Srba; Ivan Srba; Branislav Pecher; Branislav Pecher; Matus Tomlein; Matus Tomlein; Robert Moro; Robert Moro; Elena Stefancova; Elena Stefancova; Jakub Simko; Jakub Simko; Maria Bielikova; Maria Bielikova (2022). Dataset for the paper: "Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims" [Dataset]. http://doi.org/10.5281/zenodo.5996864
    Explore at:
    Dataset updated
    Apr 22, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ivan Srba; Ivan Srba; Branislav Pecher; Branislav Pecher; Matus Tomlein; Matus Tomlein; Robert Moro; Robert Moro; Elena Stefancova; Elena Stefancova; Jakub Simko; Jakub Simko; Maria Bielikova; Maria Bielikova
    Description

    Overview

    This dataset of medical misinformation was collected and is published by Kempelen Institute of Intelligent Technologies (KInIT). It consists of approx. 317k news articles and blog posts on medical topics published between January 1, 1998 and February 1, 2022 from a total of 207 reliable and unreliable sources. The dataset contains full-texts of the articles, their original source URL and other extracted metadata. If a source has a credibility score available (e.g., from Media Bias/Fact Check), it is also included in the form of annotation. Besides the articles, the dataset contains around 3.5k fact-checks and extracted verified medical claims with their unified veracity ratings published by fact-checking organisations such as Snopes or FullFact. Lastly and most importantly, the dataset contains 573 manually and more than 51k automatically labelled mappings between previously verified claims and the articles; mappings consist of two values: claim presence (i.e., whether a claim is contained in the given article) and article stance (i.e., whether the given article supports or rejects the claim or provides both sides of the argument).

    The dataset is primarily intended to be used as a training and evaluation set for machine learning methods for claim presence detection and article stance classification, but it enables a range of other misinformation related tasks, such as misinformation characterisation or analyses of misinformation spreading.

    Its novelty and our main contributions lie in (1) focus on medical news article and blog posts as opposed to social media posts or political discussions; (2) providing multiple modalities (beside full-texts of the articles, there are also images and videos), thus enabling research of multimodal approaches; (3) mapping of the articles to the fact-checked claims (with manual as well as predicted labels); (4) providing source credibility labels for 95% of all articles and other potential sources of weak labels that can be mined from the articles' content and metadata.

    The dataset is associated with the research paper "Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims" accepted and presented at ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22).

    The accompanying Github repository provides a small static sample of the dataset and the dataset's descriptive analysis in a form of Jupyter notebooks.

    Options to access the dataset

    There are two ways how to get access to the dataset:

    1. Static dump of the dataset available in the CSV format
    2. Continuously updated dataset available via REST API

    In order to obtain an access to the dataset (either to full static dump or REST API), please, request the access by following instructions provided below.

    References

    If you use this dataset in any publication, project, tool or in any other form, please, cite the following papers:

    @inproceedings{SrbaMonantPlatform,
      author = {Srba, Ivan and Moro, Robert and Simko, Jakub and Sevcech, Jakub and Chuda, Daniela and Navrat, Pavol and Bielikova, Maria},
      booktitle = {Proceedings of Workshop on Reducing Online Misinformation Exposure (ROME 2019)},
      pages = {1--7},
      title = {Monant: Universal and Extensible Platform for Monitoring, Detection and Mitigation of Antisocial Behavior},
      year = {2019}
    }
    @inproceedings{SrbaMonantMedicalDataset,
      author = {Srba, Ivan and Pecher, Branislav and Tomlein Matus and Moro, Robert and Stefancova, Elena and Simko, Jakub and Bielikova, Maria},
      booktitle = {Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22)},
      numpages = {11},
      title = {Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims},
      year = {2022},
      doi = {10.1145/3477495.3531726},
      publisher = {Association for Computing Machinery},
      address = {New York, NY, USA},
      url = {https://doi.org/10.1145/3477495.3531726},
    }
    


    Dataset creation process

    In order to create this dataset (and to continuously obtain new data), we used our research platform Monant. The Monant platform provides so called data providers to extract news articles/blogs from news/blog sites as well as fact-checking articles from fact-checking sites. General parsers (from RSS feeds, Wordpress sites, Google Fact Check Tool, etc.) as well as custom crawler and parsers were implemented (e.g., for fact checking site Snopes.com). All data is stored in the unified format in a central data storage.


    Ethical considerations

    The dataset was collected and is published for research purposes only. We collected only publicly available content of news/blog articles. The dataset contains identities of authors of the articles if they were stated in the original source; we left this information, since the presence of an author's name can be a strong credibility indicator. However, we anonymised the identities of the authors of discussion posts included in the dataset.

    The main identified ethical issue related to the presented dataset lies in the risk of mislabelling of an article as supporting a false fact-checked claim and, to a lesser extent, in mislabelling an article as not containing a false claim or not supporting it when it actually does. To minimise these risks, we developed a labelling methodology and require an agreement of at least two independent annotators to assign a claim presence or article stance label to an article. It is also worth noting that we do not label an article as a whole as false or true. Nevertheless, we provide partial article-claim pair veracities based on the combination of claim presence and article stance labels.

    As to the veracity labels of the fact-checked claims and the credibility (reliability) labels of the articles' sources, we take these from the fact-checking sites and external listings such as Media Bias/Fact Check as they are and refer to their methodologies for more details on how they were established.

    Lastly, the dataset also contains automatically predicted labels of claim presence and article stance using our baselines described in the next section. These methods have their limitations and work with certain accuracy as reported in this paper. This should be taken into account when interpreting them.


    Reporting mistakes in the dataset

    The mean to report considerable mistakes in raw collected data or in manual annotations is by creating a new issue in the accompanying Github repository. Alternately, general enquiries or requests can be sent at info [at] kinit.sk.


    Dataset structure

    Raw data

    At first, the dataset contains so called raw data (i.e., data extracted by the Web monitoring module of Monant platform and stored in exactly the same form as they appear at the original websites). Raw data consist of articles from news sites and blogs (e.g. naturalnews.com), discussions attached to such articles, fact-checking articles from fact-checking portals (e.g. snopes.com). In addition, the dataset contains feedback (number of likes, shares, comments) provided by user on social network Facebook which is regularly extracted for all news/blogs articles.

    Raw data are contained in these CSV files (and corresponding REST API endpoints):

    • sources.csv
    • articles.csv
    • article_media.csv
    • article_authors.csv
    • discussion_posts.csv
    • discussion_post_authors.csv
    • fact_checking_articles.csv
    • fact_checking_article_media.csv
    • claims.csv
    • feedback_facebook.csv

    Note: Personal information about discussion posts' authors (name, website, gravatar) are anonymised.


    Annotations

    Secondly, the dataset contains so called annotations. Entity annotations describe the individual raw data entities (e.g., article, source). Relation annotations describe relation between two of such entities.

    Each annotation is described by the following attributes:

    1. category of annotation (`annotation_category`). Possible values: label (annotation corresponds to ground truth, determined by human experts) and prediction (annotation was created by means of AI method).
    2. type of annotation (`annotation_type_id`). Example values: Source reliability (binary), Claim presence. The list of possible values can be obtained from enumeration in annotation_types.csv.
    3. method which created annotation (`method_id`). Example values: Expert-based source reliability evaluation, Fact-checking article to claim transformation method. The list of possible values can be obtained from enumeration methods.csv.
    4. its value (`value`). The value is stored in JSON format and its structure differs according to particular annotation type.


    At the same time, annotations are associated with a particular object identified by:

    1. entity type (parameter entity_type in case of entity annotations, or source_entity_type and target_entity_type in case of relation annotations). Possible values: sources, articles, fact-checking-articles.
    2. entity id (parameter entity_id in case of entity annotations, or source_entity_id and target_entity_id in case of relation

  20. p

    Intersectional Lens on Leaders_Study_Rawdata.csv

    • psycharchives.org
    Updated Oct 6, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Intersectional Lens on Leaders_Study_Rawdata.csv [Dataset]. https://www.psycharchives.org/handle/20.500.12034/7527
    Explore at:
    Dataset updated
    Oct 6, 2022
    License

    https://doi.org/10.23668/psycharchives.4988https://doi.org/10.23668/psycharchives.4988

    Description

    Younger men and especially younger women are excluded from leadership roles or obstructed from succeeding in these positions by facing backlash. Our project aims to build a more gender-specific understanding of the backlash that younger individuals in leadership positions face. We predict an interactive backlash for younger women and younger men that is rooted in intersectional stereotypes compared to the stereotypes based on single demographic categories (i.e., age or gender stereotypes). To test our hypotheses, we collect data from a heterogeneous sample (N = 900) of U.S. citizens between 25 and 69 years. We conduct an experimental online study with a between-participant design to examine the backlash against younger women and younger men. Dataset for: Daldrop, C., Buengeler, C., & Homan, A. C. (2022). An Intersectional Lens on Leadership: Prescriptive Stereotypes towards Younger Women and Younger Men and their Effect on Leadership Perception. PsychArchives. https://doi.org/10.23668/psycharchives.5404 Dataset for: Daldrop, C., Buengeler, C., & Homan, A. C. (2023). An intersectional lens on young leaders: bias toward young women and young men in leadership positions. In Frontiers in Psychology (Vol. 14). Frontiers Media SA. https://doi.org/10.3389/fpsyg.2023.120454 Research has recognized age biases against young leaders, yet understanding of how gender, the most frequently studied demographic leader characteristic, influences this bias remains limited. In this study, we examine the gender-specific age bias toward young female and young male leaders through an intersectional lens. By integrating intersectionality theory with insights on status beliefs associated with age and gender, we test whether young female and male leaders face an interactive rather than an additive form of bias. We conducted two preregistered experimental studies (N1 = 918 and N2 = 985), where participants evaluated leaders based on age, gender, or a combination of both. Our analysis reveals a negative age bias in leader status ascriptions toward young leaders compared to middle-aged and older leaders. This bias persists when gender information is added, as demonstrated in both intersectional categories of young female and young male leaders. This bias pattern does not extend to middle-aged or older female and male leaders, thereby supporting the age bias against young leaders specifically. Interestingly, we also examined whether social dominance orientation strengthens the bias against young (male) leaders, but our results (reported in the SOM) are not as hypothesized. In sum, our results emphasize the importance of young age as a crucial demographic characteristic in leadership perceptions that can even overshadow the role of gender.: Raw Data File

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
GIS in Schools - Teaching Materials - New Zealand (2017). Creating points from addresses in ArcGIS Online - Points of Interest CSV [Dataset]. https://resources-gisinschools-nz.hub.arcgis.com/datasets/5d53d4b14ac64843af6d33f186c55ff5

Creating points from addresses in ArcGIS Online - Points of Interest CSV

Explore at:
Dataset updated
Sep 4, 2017
Dataset authored and provided by
GIS in Schools - Teaching Materials - New Zealand
Description

Creating points from addresses in ArcGIS Online lesson. http://arcg.is/2vEljQx

Search
Clear search
Close search
Google apps
Main menu