Creating points from addresses in ArcGIS Online lesson. http://arcg.is/2vEljQx
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains all the citation data (in CSV format) included in the OpenCitation Index (https://opencitations.net/index), released on July 10, 2025. In particular, each line of the CSV file defines a citation, and includes the following information:[field "oci"] the Open Citation Identifier (OCI) for the citation;[field "citing"] the OMID of the citing entity;[field "cited"] the OMID of the cited entity;[field "creation"] the creation date of the citation (i.e. the publication date of the citing entity);[field "timespan"] the time span of the citation (i.e. the interval between the publication date of the cited entity and the publication date of the citing entity);[field "journal_sc"] it records whether the citation is a journal self-citations (i.e. the citing and the cited entities are published in the same journal);[field "author_sc"] it records whether the citation is an author self-citation (i.e. the citing and the cited entities have at least one author in common).Note: the information for each citation is sourced from OpenCitations Meta (https://opencitations.net/meta), a database that stores and delivers bibliographic metadata for all bibliographic resources included in the OpenCitations Index. The data provided in this dump is therefore based on the state of OpenCitations Meta at the time this collection was generated.This version of the dataset contains:2,216,426,689 citationsThe size of the zipped archive is 38.8 GB, while the size of the unzipped CSV file is 242 GB.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Residential School Locations Dataset [IRS_Locations.csv] contains the locations (latitude and longitude) of Residential Schools and student hostels operated by the federal government in Canada. All the residential schools and hostels that are listed in the Indian Residential School Settlement Agreement are included in this dataset, as well as several Industrial schools and residential schools that were not part of the IRRSA. This version of the dataset doesn’t include the five schools under the Newfoundland and Labrador Residential Schools Settlement Agreement. The original school location data was created by the Truth and Reconciliation Commission, and was provided to the researcher (Rosa Orlandini) by the National Centre for Truth and Reconciliation in April 2017. The dataset was created by Rosa Orlandini, and builds upon and enhances the previous work of the Truth and Reconcilation Commission, Morgan Hite (creator of the Atlas of Indian Residential Schools in Canada that was produced for the Tk'emlups First Nation and Justice for Day Scholar's Initiative, and Stephanie Pyne (project lead for the Residential Schools Interactive Map). Each individual school location in this dataset is attributed either to RSIM, Morgan Hite, NCTR or Rosa Orlandini. Many schools/hostels had several locations throughout the history of the institution. If the school/hostel moved from its’ original location to another property, then the school is considered to have two unique locations in this dataset,the original location and the new location. For example, Lejac Indian Residential School had two locations while it was operating, Stuart Lake and Fraser Lake. If a new school building was constructed on the same property as the original school building, it isn't considered to be a new location, as is the case of Girouard Indian Residential School.When the precise location is known, the coordinates of the main building are provided, and when the precise location of the building isn’t known, an approximate location is provided. For each residential school institution location, the following information is provided: official names, alternative name, dates of operation, religious affiliation, latitude and longitude coordinates, community location, Indigenous community name, contributor (of the location coordinates), school/institution photo (when available), location point precision, type of school (hostel or residential school) and list of references used to determine the location of the main buildings or sites.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Co-Creation Database groups scientific references on co-creation.
It mainly contains the title, abstract, DOI, and authors.
Two versions are available:
Version 1.5 includes 13,501 references, from PubMed, ProQuest and CINAHL, from January 1970 to November 2021. Available in RIS (Research Information Systems) format and CSV (CSV UTF-8). Quality metrics: 9.38% false negatives; 20.35% false positives.
Version 2.0 is an update from a classification model trained with version 1.5. It includes references from Scopus and Web of Science from January 1970 to March 2023, with an update of the previous databases used for version 1.5 from December 2021 to March 2023. Two CSV (CSV UTF-8) files are available. The "Co-Creation Database v2.0 - full.csv" combines the last version, 1.5 and the update, with 52,821 references. The file "Co-Creation Database v2.0 - adding.csv" has only the update, with 39,219 references. Quality metrics: 13.98% false negatives; 36.43% false positives.
To perform your search: we recommend you extend your search to the title and abstract since some data are initially missing. The RIS file can be uploaded to any references manager (e.g., Zotero, Mendeley, etc.), where you will have the feature to search. For example, here is the link for advanced search instructions in Zotero: https://www.zotero.org/support/searching. Additionally, You can run a Boolean search for CSV files using a Python script.
To improve the database in further updates: we make available an online form to submit any irrelevant references you may find or to submit any relevant reference not inside the last version. The form is available at the following link: https://forms.office.com/e/6vu9X0kBcw
It was produced as part of Health CASCADE, a Marie Skłodowska-Curie Innovative Training Network funded by the European Union's Horizon 2020 research and innovation programme under Marie Skłodowska-Curie grant agreement n° 956501.
The work is made available under the terms of the license CC-BY-NC-4.0 (Creative Common Attribution - NonCommercial - NoDerivatives 4.0 International).
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
This furniture e-commerce dataset includes 140,000+ structured product records collected from online retail sources. Each entry provides detailed product information, categories, and breadcrumb hierarchies, making it ideal for AI, machine learning, and analytics applications.
Key Features:
📊 140K+ furniture product records in structured format
🏷 Includes categories, subcategories, and breadcrumbs for taxonomy mapping
📂 Delivered as a clean CSV file for easy integration
🔎 Perfect dataset for AI, NLP, and machine learning model training
Best Use Cases:
✔ LLM training & fine-tuning with domain-specific data
✔ Product classification datasets for AI models
✔ Recommendation engines & personalization in e-commerce
✔ Market research & furniture retail analytics
✔ Search optimization & taxonomy enrichment
Why this dataset?
Large volume (140K+ furniture records) for robust training
Real-world e-commerce product data
Ready-to-use CSV, saving preprocessing time
Affordable licensing with bulk discounts for enterprise buyers
Note:
Each record in this dataset includes both a url
(main product page) and a buy_url
(the actual purchase page).
The dataset is structured so that records are based on the buy_url
, ensuring you get unique, actionable product-level data instead of just generic landing pages.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Every two years the WECC (Western Electricity Coordinating Council) releases an Anchor Data Set (ADS) to be analyzed with a Production Cost Models (PCM) and which represents the expected loads, resources, and transmission topology 10 years in the future from a given reference year. For hydropower resources, the WECC relies on members to provide data to parameterize the hydropower representation in production cost models. The datasets consist of plant-level hydropower generation, flexibility, ramping, and mode of operations and are tied to the hydropower representation in those production cost models.
In 2022, PNNL supported the WECC by developing the WECC ADS 2032 hydropower dataset [1]. The WECC ADS 2032 hydropower dataset (generation and flexibility) included an update of the climate year conditions (2018 calendar year), consistency in representation across the entire US WECC footprint, updated hydropower operations over the core Columbia River, and a higher temporal resolution (weekly instead of monthly)[1] associated with a GridView software update (weekly hydro logic). Proprietary WECC utility hydropower data were used when available to develop the monthly and weekly datasets and were completed with HydroWIRES B1 methods to develop the Hydro 923 plus (now RectifHydPlus weekly hydropower dataset) [2] and the flexibility parameterization [3]. The team worked with Bonneville Power Administration to develop hydropower datasets over the core Columbia River representative of the post-2018 change in environmental regulation (flex spill). Ramping data are considered proprietary, were leveraged from WECC ADS 2030, and were not provided in the release, nor are the WECC-member hydropower data.
This release represents the WECC ADS 2034 hydropower dataset. The generator database was first updated by WECC. Based on a review of hourly generation profiles, 16 facilities were transitioned from fixed schedule to dispatchable (380.5MW). The operations of the core Columbia River were updated based on Bonneville Power Administration's long-term hydro-modeling using 2020-level of modified flows and using fiscal year 2031 expected operations. The update was necessary to reflect the new environmental regulation (EIS2023). The team also included a newly developed extension over Canada [4] that improves upon existing data and synchronizes the US and Canadian data to the same 2018 weather year. Canadian facilities over the Peace River were not updated due to a lack of available flow data. The team was able to modernize and improve the overall data processing using modern tools as well as provide thorough documentation and reproducible workflows [5,6]. The datasets have been incorporated into the 2034 ADS and are in active use by WECC and the community.
WECC ADS 2034 hydropower datasets contain generation at weekly and monthly timesteps, for US hydropower plants, monthly generation for Canadian hydropower plants, and the two merged together. Separate datasets are included for generation by hydropower plant and generation by individual generator units. Only processed data are provided. Original WECC-utility hourly data are under a non-disclosure agreement and for the sole use of developing this dataset.
[1] Voisin, N., Harris, K. M., Oikonomou, K., Turner, S., Johnson, A., Wallace, S., Racht, P., et al. (2022). WECC ADS 2032 Hydropower Dataset (PNNL-SA-172734). See presentation (Voisin N., K.M. Harris, K. Oikonomou, and S. Turner. 04/05/2022. "WECC 2032 Anchor Dataset - Hydropower." Presented by N. Voisin, K. Oikonomou at WECC Production Cost Model Dataset Subcommittee Meeting, Online, Utah. PNNL-SA-171897.).
[2] Turner, S. W. D., Voisin, N., Oikonomou, K., & Bracken, C. (2023). Hydro 923: Monthly and Weekly Hydropower Constraints Based on Disaggregated EIA-923 Data (v1.1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.8212727
[3] Stark, G., Barrows, C., Dalvi, S., Guo, N., Michelettey, P., Trina, E., Watson, A., Voisin, N., Turner, S., Oikonomou, K. and Colotelo, A. 2023 Improving the Representation of Hydropower in Production Cost Models, NREL/TP-5700-86377, United States. https://www.osti.gov/biblio/1993943
[4] Son, Y., Bracken, C., Broman, D., & Voisin, N. (2025). Monthly Hydropower Generation Dataset for Western Canada (1.1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.14984725
[5] https://github.com/HydroWIRES-PNNL/weccadshydro/
File | Description | Timestep | Spatial Extent |
US_Monthly_Plant.csv | Generation data for US plants at a monthly timestep | Monthly | US |
US_Weekly_Plant.csv | Generation data for US plants at a weekly timestep | Weekly | US |
US_Monthly_Unit.csv | Generation data for US plants by generator units at a monthly timestep | Monthly | US |
US_Weekly_Unit.csv | Generation data for US plants by generator units at a weekly timestep | Weekly | US |
Canada_Monthly_Plant.csv | Generation data for Canadian plants at a monthly timestep | Monthly | Canada |
Canada_Monthly_Unit.csv | Generation data for Canadian plants by generator units at a monthly timestep | Monthly | Canada |
Merged_Monthly_Plant.csv | Generation data for US and Canadian plants at a monthly timestep | Monthly | US and Canada |
Merged_Monthly_Unit.csv | Generation data for US and Canadian plants by generator units at a monthly timestep | Monthly | US and Canada |
Overview presentation of the WECC ADS 2034 dataset | N/A | N/A | |
PNNL-SA-171897.pdf | Overview presentation of the WECC ADS 2032 dataset | N/A | N/A |
Each dataset contains the following column headers:
Column Name | Unit | Description |
Source | N/A | Indicates the method used to develop the data (see below) |
Generator Name | N/A | Generator name used in WECC PCM (in unit datasets) |
EIA ID | N/A | Energy Information Administration (EIA) plant ID (in plant datasets) |
DataTypeName | N/A | Data type (see below) |
DatatypeID | N/A | Data type ID |
Year | year | Year (not used) |
Week1 [Month1] | MWh | generation MWh value for data type; subsequent week or month columns contain data for each week or month in the dataset period |
The dataset contains data from four different data sources, developed using different methods:
<td style="padding: .75pt .75ptSource | Description |
PNNL |
Weekly / monthly aggregation performed by PNNL using hourly observed facility-scale generation provided in 2022 by asset owners for year 2018 |
BPA |
BPA long-term hydromodeling (HYDSIM) with 2020-Level Modified Flows for Water Years 1989-2018 Using FY 2031 expected operations (EIS2023). Jan-Sept comes from 2018 and Oct-Dec from year 2007. |
CAISO |
Weekly / monthly aggregation performed by CAISO using hourly observed facility-scale generation for 2018. Daily flexibility also directly provided by CAISO |
Canada |
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The property register is kept in electronic form in the Official Property Register Information System (ALKIS). The present Web Feature Service enables the targeted download of geo-objects in ALKIS based on a search query (direct access download service). The Service only provides the following geo-objects limited to the essential properties in the format of a simplified data exchange scheme defined in the “AdV product specification ALKIS-WFS and output formats (Shape, CSV)” (see www.adv-online.de): Plots of land [including owners]. The service is designed for use in simple, practical GIS clients without complex functionalities. Output format is CSV. If multiple attributes (e.g. owners) are present in the feature type Landstueck, so many objects/datasets are output for a parcel until all attributes are unique. The service includes personal data and is secured (in the LVN NRW), registration is required for access. Status of the data used: 01.04.2022.
The relevant features of the LIWC psychological dictionary are extracted from the consultation text after preprocessing the depression consultation data collected from the online consultation platform File name: DepressionLevelPrediction-LIWC-Processed.csv Creation time: 2022-12-20 Function: explore the relationship between LIWC-based features and depression Data volume: 3859 Data format: utf8 Field description: ID: consultation record code Depression: degree of depression (3: severe; 2: moderate; 1: mild; 0: undiagnosed) Age: age Gender: gender (1: male 0: female) Region: Region (temporarily unused) Identity: Identity (not used temporarily) Socialize: sociality Emotion: Emotion Cognition: cognition Perception: Perception Physiology: physiology Gains or losses
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Ly, L.H., Protopopova, A. (2022). A mixed-method analysis of the consistency of intake information reported by shelter staff upon owner surrender of dogs.
This dataset is the raw file from an online experiment assessing the agreement in data input for surrender reason, breed, and colour across shelter staff when presented with four complex narratives of fictional owners surrendering dogs.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Data in the Classroom is an online curriculum to foster data literacy. This Ocean Acidification module is geared towards grades 8-12. Visit Data in the Classroom for more information.This application is the Ocean Acidification module.This module was developed to engage students in increasingly sophisticated modes of understanding and manipulation of data. It was completed prior to the release of the Next Generation Science Standards (NGSS)* and has recently been adapted to incorporate some of the innovations described in the NGSS.Each level of the module provides learning experiences that engage students in the three dimensions of the NGSS Framework while building towards competency in targeted performance expectations. Note: this document identifies the specific practice, core idea and concept directly associated with a performance expectation (shown in parentheses in the tables) but also includes additional practices and concepts that can help students build toward a standard.*NGSS Lead States. 2013. Next Generation Science Standards: For States, By States. Washington, DC: The National Academies Press. Next Generation Science Standards is a registered trademark of Achieve. Neither Achieve nor the lead states and partners that developed the Next Generation Science Standards was involved in the production of, and does not endorse, this product.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository was created for my Master's thesis in Computational Intelligence and Internet of Things at the University of Córdoba, Spain. The purpose of this repository is to store the datasets found that were used in some of the studies that served as research material for this Master's thesis. Also, the datasets used in the experimental part of this work are included.
Below are the datasets specified, along with the details of their references, authors, and download sources.
----------- STS-Gold Dataset ----------------
The dataset consists of 2026 tweets. The file consists of 3 columns: id, polarity, and tweet. The three columns denote the unique id, polarity index of the text and the tweet text respectively.
Reference: Saif, H., Fernandez, M., He, Y., & Alani, H. (2013). Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold.
File name: sts_gold_tweet.csv
----------- Amazon Sales Dataset ----------------
This dataset is having the data of 1K+ Amazon Product's Ratings and Reviews as per their details listed on the official website of Amazon. The data was scraped in the month of January 2023 from the Official Website of Amazon.
Owner: Karkavelraja J., Postgraduate student at Puducherry Technological University (Puducherry, Puducherry, India)
Features:
License: CC BY-NC-SA 4.0
File name: amazon.csv
----------- Rotten Tomatoes Reviews Dataset ----------------
This rating inference dataset is a sentiment classification dataset, containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. On average, these reviews consist of 21 words. The first 5331 rows contains only negative samples and the last 5331 rows contain only positive samples, thus the data should be shuffled before usage.
This data is collected from https://www.cs.cornell.edu/people/pabo/movie-review-data/ as a txt file and converted into a csv file. The file consists of 2 columns: reviews and labels (1 for fresh (good) and 0 for rotten (bad)).
Reference: Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 115–124, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics
File name: data_rt.csv
----------- Preprocessed Dataset Sentiment Analysis ----------------
Preprocessed amazon product review data of Gen3EcoDot (Alexa) scrapped entirely from amazon.in
Stemmed and lemmatized using nltk.
Sentiment labels are generated using TextBlob polarity scores.
The file consists of 4 columns: index, review (stemmed and lemmatized review using nltk), polarity (score) and division (categorical label generated using polarity score).
DOI: 10.34740/kaggle/dsv/3877817
Citation: @misc{pradeesh arumadi_2022, title={Preprocessed Dataset Sentiment Analysis}, url={https://www.kaggle.com/dsv/3877817}, DOI={10.34740/KAGGLE/DSV/3877817}, publisher={Kaggle}, author={Pradeesh Arumadi}, year={2022} }
This dataset was used in the experimental phase of my research.
File name: EcoPreprocessed.csv
----------- Amazon Earphones Reviews ----------------
This dataset consists of a 9930 Amazon reviews, star ratings, for 10 latest (as of mid-2019) bluetooth earphone devices for learning how to train Machine for sentiment analysis.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 5 columns: ReviewTitle, ReviewBody, ReviewStar, Product and division (manually added - categorical label generated using ReviewStar score)
License: U.S. Government Works
Source: www.amazon.in
File name (original): AllProductReviews.csv (contains 14337 reviews)
File name (edited - used for my research) : AllProductReviews2.csv (contains 9930 reviews)
----------- Amazon Musical Instruments Reviews ----------------
This dataset contains 7137 comments/reviews of different musical instruments coming from Amazon.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 10 columns: reviewerID, asin (ID of the product), reviewerName, helpful (helpfulness rating of the review), reviewText, overall (rating of the product), summary (summary of the review), unixReviewTime (time of the review - unix time), reviewTime (time of the review (raw) and division (manually added - categorical label generated using overall score).
Source: http://jmcauley.ucsd.edu/data/amazon/
File name (original): Musical_instruments_reviews.csv (contains 10261 reviews)
File name (edited - used for my research) : Musical_instruments_reviews2.csv (contains 7137 reviews)
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset provides a comprehensive overview of online sales transactions across different product categories. Each row represents a single transaction with detailed information such as the order ID, date, category, product name, quantity sold, unit price, total price, region, and payment method.
This archive contains code and data for reproducing the analysis for “Replication Data for Revisiting ‘The Rise and Decline’ in a Population of Peer Production Projects”. Depending on what you hope to do with the data you probabbly do not want to download all of the files. Depending on your computation resources you may not be able to run all stages of the analysis. The code for all stages of the analysis, including typesetting the manuscript and running the analysis, is in code.tar. If you only want to run the final analysis or to play with datasets used in the analysis of the paper, you want intermediate_data.7z or the uncompressed tab and csv files. The data files are created in a four-stage process. The first stage uses the program “wikiq” to parse mediawiki xml dumps and create tsv files that have edit data for each wiki. The second stage generates all.edits.RDS file which combines these tsvs into a dataset of edits from all the wikis. This file is expensive to generate and at 1.5GB is pretty big. The third stage builds smaller intermediate files that contain the analytical variables from these tsv files. The fourth stage uses the intermediate files to generate smaller RDS files that contain the results. Finally, knitr and latex typeset the manuscript. A stage will only run if the outputs from the previous stages do not exist. So if the intermediate files exist they will not be regenerated. Only the final analysis will run. The exception is that stage 4, fitting models and generating plots, always runs. If you only want to replicate from the second stage onward, you want wikiq_tsvs.7z. If you want to replicate everything, you want wikia_mediawiki_xml_dumps.7z.001 wikia_mediawiki_xml_dumps.7z.002, and wikia_mediawiki_xml_dumps.7z.003. These instructions work backwards from building the manuscript using knitr, loading the datasets, running the analysis, to building the intermediate datasets. Building the manuscript using knitr This requires working latex, latexmk, and knitr installations. Depending on your operating system you might install these packages in different ways. On Debian Linux you can run apt install r-cran-knitr latexmk texlive-latex-extra. Alternatively, you can upload the necessary files to a project on Overleaf.com. Download code.tar. This has everything you need to typeset the manuscript. Unpack the tar archive. On a unix system this can be done by running tar xf code.tar. Navigate to code/paper_source. Install R dependencies. In R. run install.packages(c("data.table","scales","ggplot2","lubridate","texreg")) On a unix system you should be able to run make to build the manuscript generalizable_wiki.pdf. Otherwise you should try uploading all of the files (including the tables, figure, and knitr folders) to a new project on Overleaf.com. Loading intermediate datasets The intermediate datasets are found in the intermediate_data.7z archive. They can be extracted on a unix system using the command 7z x intermediate_data.7z. The files are 95MB uncompressed. These are RDS (R data set) files and can be loaded in R using the readRDS. For example newcomer.ds <- readRDS("newcomers.RDS"). If you wish to work with these datasets using a tool other than R, you might prefer to work with the .tab files. Running the analysis Fitting the models may not work on machines with less than 32GB of RAM. If you have trouble, you may find the functions in lib-01-sample-datasets.R useful to create stratified samples of data for fitting models. See line 89 of 02_model_newcomer_survival.R for an example. Download code.tar and intermediate_data.7z to your working folder and extract both archives. On a unix system this can be done with the command tar xf code.tar && 7z x intermediate_data.7z. Install R dependencies. install.packages(c("data.table","ggplot2","urltools","texreg","optimx","lme4","bootstrap","scales","effects","lubridate","devtools","roxygen2")). On a unix system you can simply run regen.all.sh to fit the models, build the plots and create the RDS files. Generating datasets Building the intermediate files The intermediate files are generated from all.edits.RDS. This process requires about 20GB of memory. Download all.edits.RDS, userroles_data.7z,selected.wikis.csv, and code.tar. Unpack code.tar and userroles_data.7z. On a unix system this can be done using tar xf code.tar && 7z x userroles_data.7z. Install R dependencies. In R run install.packages(c("data.table","ggplot2","urltools","texreg","optimx","lme4","bootstrap","scales","effects","lubridate","devtools","roxygen2")). Run 01_build_datasets.R. Building all.edits.RDS The intermediate RDS files used in the analysis are created from all.edits.RDS. To replicate building all.edits.RDS, you only need to run 01_build_datasets.R when the int... Visit https://dataone.org/datasets/sha256%3Acfa4980c107154267d8eb6dc0753ed0fde655a73a062c0c2f5af33f237da3437 for complete metadata about this dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Public data set which is used within the plan4res project for performing case study 1 "Multi-modal European energy concept for achiving COP21" - Multi-modal Investment modelling (MIM) Part 1: Time series for the reference year 2015
The related documentation is included in plan4res' deliverable D4.5 chapter 3.2 (see 10.5281/zenodo.3785010)
The data set includes the following data:
a) characteristic annual load profiles for large industrial heat demand for chemical, iron & steel, food & beverage and pulp & paper industries for the reference year 2015
HOTMAPS_TD_OUT_D_CHEM_20200608T160653_20200422T120000Z_v01.csv
HOTMAPS_TD_OUT_D_FOOD_20200608T160724_20200422T120000Z_v01.csv
HOTMAPS_TD_OUT_D_IRON_20200608T160705_20200422T120000Z_v01.csv HOTMAPS_TD_OUT_D_PAPER_20200608T160715_20200422T120000Z_v01.csv
b) characteristic demand profiles for road-side car passenger transport and availability of cars for charging while (home) parking for the reference year 2015
SIEMENS_TD_OUT_D_RoadCar_20200608T160627_20200401T120000Z_v01.csv SIEMENS_TD_CAP_CarPark_20200608T160637_20200401T120000Z_v01.csv
c) load profiles for exogeneous demand of electricity for the reference year 2015. The exogenous demand includes all electricity consumptions not explicitly modeled within MIM modeling.
HRE4_TD_OUT_ElectricityExo_20200608T160732_20200401T120000Z_v01.csv
c) regionally resolved demand profiles for (individual) space heating and space cooling for the reference year 2015
HRE4_TRD_CAP_Cool_2015_20200608T160051_20200401T120000Z_v01.csv
HRE4_TRD_CAP_HeatInd_2015_20200608T155849_20200401T120000Z_v01.csv
d) regionally resolved generation profiles of electricity from photovoltaic, wind onshore, wind offshore, hydro run-of-river, and for heat generation from solar thermal for the reference year 2015
NINJA_TRD_CAP_PV_2015_20200608T160440_20191104T120000Z_v01.csv
NINJA_TRD_CAP_WindOFF_2015_20200608T155422_20191104T120000Z_v01.csv
NINJA_TRD_CAP_WindON_2015_20200608T155251_20191104T120000Z_v01.csv HRE4_TRD_CAP_HydroRoR_2015_20200608T155550_20200401T120000Z_v01.csv
HRE4_TRD_CAP_SolarThermal_2015_20200608T155718_20200401T120000Z_v01.csv
e) regionally resolved generation profile of electricity from wind offshore transformed in a way to represent potential capacity factors in future as stated by doi:10.2760/041705. Data based on reference year 2015
SIEMENS_TRD_CAP_WindOFF_2040_20200608T155127_20200401T120000Z_v01.csv
x) A list of geographical description of the zone hierarchy data used in MIM for the EU33 region set.:
SIEMENS_ZoneHierarchy_MIM_EU33_20181231T120000Z_20200131T1200000Z_v001.csv
Further info:
Time series are based on historical data for the reference year 2015.
Values are normalized over one reference year in a way that either the maximum = 1 (CAP) or the integral = 1 (OUT).
All values are listed in arbitrary units.
All country names are according to ISO 3166-1 alpha-2.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
The COVID-19 pandemic profoundly affected various aspects of daily life, particularly the supply and demand of essential goods, resulting in critical shortages. This included personal protective equipment (PPE) for medical professionals and the general public. To address these shortages, online "maker communities" emerged, aiming to develop and locally manufacture critical products. While some organized efforts existed, the majority of initiatives originated from individuals and groups on platforms like Thingiverse. This paper presents a longitudinal analysis of Thingiverse, one of the largest maker community websites, to examine the pandemic's effects. Our findings reveal a surge in community output during the initial lockdown periods in major contributing nations (primarily those in the western-hemisphere), followed by a subsequent decline. Additionally, throughout 2020, pandemic-related products dominated uploads and interactions during this period. Based on these observations, we propose recommendations to expedite the community's ability to support local, national, and international responses to future disasters. Methods Collected using the Thingiverse API, some has been processed into csv to make it easier to handle.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Data in the Classroom is an online curriculum to foster data literacy. This Ocean Acidification module is geared towards grades 8-12. Visit Data in the Classroom for more information.This application is the Ocean Acidification module.This module was developed to engage students in increasingly sophisticated modes of understanding and manipulation of data. It was completed prior to the release of the Next Generation Science Standards (NGSS)* and has recently been adapted to incorporate some of the innovations described in the NGSS.Each level of the module provides learning experiences that engage students in the three dimensions of the NGSS Framework while building towards competency in targeted performance expectations. Note: this document identifies the specific practice, core idea and concept directly associated with a performance expectation (shown in parentheses in the tables) but also includes additional practices and concepts that can help students build toward a standard.*NGSS Lead States. 2013. Next Generation Science Standards: For States, By States. Washington, DC: The National Academies Press. Next Generation Science Standards is a registered trademark of Achieve. Neither Achieve nor the lead states and partners that developed the Next Generation Science Standards was involved in the production of, and does not endorse, this product.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Heaven Dataset (Refined)
Dataset Overview
The Heaven Dataset (Refined) is a collection of messages with classifications related to predatory behavior detection in online conversations. This dataset is designed to help train and evaluate AI models that can identify potentially harmful communication patterns directed at minors.
Dataset Description
General Information
Dataset Name: Heaven Dataset (Refined) Version: 1.0 File Format: CSV Creation Date:… See the full description on the dataset page: https://huggingface.co/datasets/safecircleai/heaven_dataset_v2.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Input files for Dispa-SET for the JRC report "Power System Flexibility in a variable climate"
Here you can find the input files needed to reproduce the results of the report:
De Felice, M., Busch, S., Kanellopoulos, K., Kavvadias, K. and Hidalgo Gonzalez, I., Power system flexibility in a variable climate, EUR 30184 EN, Publications Office of the European Union, Luxembourg, 2020, ISBN 978-92-76-18183-5 (online), doi:10.2760/75312 (online), JRC120338.
The results in the report are generated with the Dispa-SET power system model, available and explained at www.dispaset.eu.
A description of the data sources with the references can be found into the report.
How to use this dataset
This dataset can be used as input data for the Dispa-SET model. We refer to the report and the official model documentation for information about the data and the model.
Description of the dataset
The file EnVarClim.yml
is a template of the YAML configuration file used by Dispa-SET. To run a specific climate year the XXXX
present in some input files must be replaced with the year.
Availability factors
In the folder AvailabilityFactors
there are the availability factors (from 0 to 1) for the power plants and the renewable generation. There is a subfolder for each simulated zone and inside a file for each climate year: from emh_and_cc_availability_1990.csv
to emh_and_cc_availability_2015.csv
.
Cross-border transmission
In the folder DayAheadNTC
there is the file merged_constant_NTC.csv
containing the capacity (in MW).
NOTE: due to an error in the pre-processing code there are some additional lines for the Western Balkans countries ending with a 1
(e.g. GR -> MK1
). Those lines are ignored by the model because are not associated to any simulated zone.
Cross-border historical flows
In the file CC_L_flows.csv
under the folder Flows
are contained the hourly flows between the simulated zones and their neighbours (RU, TR, UA).
Fuel prices
In the folder FuelPrices
are contained a set of files containing the hourly prices for the fuels (biomass, coal, lignite, gas, oil) and CO2 emissions. It is worth noting that in spite of their hourly resolution the time-series are constant through the year.
Hourly load
In the folder Load_RealTime
there are hourly load time-series for each zone considering a different climate year. For the Western Balkans countries we use the same time-series for each climate year.
Outage factors
The files CC_L_outages.csv
in the folder OutageFactors
contain the outage factor (from 1, full outage, to 0) for the various generation units. Whenever a simulation zone is missing the model assumes the absence of outages.
Power plants data
In the folder PowerPlants
there is a file named CC_L_plants.mip.csv
for each simulated zone. The CSV files contain the data needed by Dispa-SET.
Water storage levels
The folder ReservoirLevel
contains the storage level (values from 0 to 1 relative to the size of the storage) for all the simulated zones. The levels have been computed for each climate year using a different inflow using the mid-term scheduler recently implemented in Dispa-SET. For the Western Balkans countries we use the same time-series for each climate year.
Hydro-power inflows
In the folder ScaledInflows
are contained the inflows used for the hydro-power generation. The values in the CSV files describes how much energy is available for hydro-power generation compared to the installed capacity.
Linked resources
Overview
This dataset of medical misinformation was collected and is published by Kempelen Institute of Intelligent Technologies (KInIT). It consists of approx. 317k news articles and blog posts on medical topics published between January 1, 1998 and February 1, 2022 from a total of 207 reliable and unreliable sources. The dataset contains full-texts of the articles, their original source URL and other extracted metadata. If a source has a credibility score available (e.g., from Media Bias/Fact Check), it is also included in the form of annotation. Besides the articles, the dataset contains around 3.5k fact-checks and extracted verified medical claims with their unified veracity ratings published by fact-checking organisations such as Snopes or FullFact. Lastly and most importantly, the dataset contains 573 manually and more than 51k automatically labelled mappings between previously verified claims and the articles; mappings consist of two values: claim presence (i.e., whether a claim is contained in the given article) and article stance (i.e., whether the given article supports or rejects the claim or provides both sides of the argument).
The dataset is primarily intended to be used as a training and evaluation set for machine learning methods for claim presence detection and article stance classification, but it enables a range of other misinformation related tasks, such as misinformation characterisation or analyses of misinformation spreading.
Its novelty and our main contributions lie in (1) focus on medical news article and blog posts as opposed to social media posts or political discussions; (2) providing multiple modalities (beside full-texts of the articles, there are also images and videos), thus enabling research of multimodal approaches; (3) mapping of the articles to the fact-checked claims (with manual as well as predicted labels); (4) providing source credibility labels for 95% of all articles and other potential sources of weak labels that can be mined from the articles' content and metadata.
The dataset is associated with the research paper "Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims" accepted and presented at ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22).
The accompanying Github repository provides a small static sample of the dataset and the dataset's descriptive analysis in a form of Jupyter notebooks.
Options to access the dataset
There are two ways how to get access to the dataset:
1. Static dump of the dataset available in the CSV format
2. Continuously updated dataset available via REST API
In order to obtain an access to the dataset (either to full static dump or REST API), please, request the access by following instructions provided below.
References
If you use this dataset in any publication, project, tool or in any other form, please, cite the following papers:
@inproceedings{SrbaMonantPlatform,
author = {Srba, Ivan and Moro, Robert and Simko, Jakub and Sevcech, Jakub and Chuda, Daniela and Navrat, Pavol and Bielikova, Maria},
booktitle = {Proceedings of Workshop on Reducing Online Misinformation Exposure (ROME 2019)},
pages = {1--7},
title = {Monant: Universal and Extensible Platform for Monitoring, Detection and Mitigation of Antisocial Behavior},
year = {2019}
}
@inproceedings{SrbaMonantMedicalDataset,
author = {Srba, Ivan and Pecher, Branislav and Tomlein Matus and Moro, Robert and Stefancova, Elena and Simko, Jakub and Bielikova, Maria},
booktitle = {Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22)},
numpages = {11},
title = {Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims},
year = {2022},
doi = {10.1145/3477495.3531726},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3477495.3531726},
}
Dataset creation process
In order to create this dataset (and to continuously obtain new data), we used our research platform Monant. The Monant platform provides so called data providers to extract news articles/blogs from news/blog sites as well as fact-checking articles from fact-checking sites. General parsers (from RSS feeds, Wordpress sites, Google Fact Check Tool, etc.) as well as custom crawler and parsers were implemented (e.g., for fact checking site Snopes.com). All data is stored in the unified format in a central data storage.
Ethical considerations
The dataset was collected and is published for research purposes only. We collected only publicly available content of news/blog articles. The dataset contains identities of authors of the articles if they were stated in the original source; we left this information, since the presence of an author's name can be a strong credibility indicator. However, we anonymised the identities of the authors of discussion posts included in the dataset.
The main identified ethical issue related to the presented dataset lies in the risk of mislabelling of an article as supporting a false fact-checked claim and, to a lesser extent, in mislabelling an article as not containing a false claim or not supporting it when it actually does. To minimise these risks, we developed a labelling methodology and require an agreement of at least two independent annotators to assign a claim presence or article stance label to an article. It is also worth noting that we do not label an article as a whole as false or true. Nevertheless, we provide partial article-claim pair veracities based on the combination of claim presence and article stance labels.
As to the veracity labels of the fact-checked claims and the credibility (reliability) labels of the articles' sources, we take these from the fact-checking sites and external listings such as Media Bias/Fact Check as they are and refer to their methodologies for more details on how they were established.
Lastly, the dataset also contains automatically predicted labels of claim presence and article stance using our baselines described in the next section. These methods have their limitations and work with certain accuracy as reported in this paper. This should be taken into account when interpreting them.
Reporting mistakes in the dataset
The mean to report considerable mistakes in raw collected data or in manual annotations is by creating a new issue in the accompanying Github repository. Alternately, general enquiries or requests can be sent at info [at] kinit.sk.
Dataset structure
Raw data
At first, the dataset contains so called raw data (i.e., data extracted by the Web monitoring module of Monant platform and stored in exactly the same form as they appear at the original websites). Raw data consist of articles from news sites and blogs (e.g. naturalnews.com), discussions attached to such articles, fact-checking articles from fact-checking portals (e.g. snopes.com). In addition, the dataset contains feedback (number of likes, shares, comments) provided by user on social network Facebook which is regularly extracted for all news/blogs articles.
Raw data are contained in these CSV files (and corresponding REST API endpoints):
Note: Personal information about discussion posts' authors (name, website, gravatar) are anonymised.
Annotations
Secondly, the dataset contains so called annotations. Entity annotations describe the individual raw data entities (e.g., article, source). Relation annotations describe relation between two of such entities.
Each annotation is described by the following attributes:
At the same time, annotations are associated with a particular object identified by:
entity_type
in case of entity annotations, or source_entity_type
and target_entity_type
in case of relation annotations). Possible values: sources, articles, fact-checking-articles.entity_id
in case of entity annotations, or source_entity_id
and target_entity_id
in case of relation
https://doi.org/10.23668/psycharchives.4988https://doi.org/10.23668/psycharchives.4988
Younger men and especially younger women are excluded from leadership roles or obstructed from succeeding in these positions by facing backlash. Our project aims to build a more gender-specific understanding of the backlash that younger individuals in leadership positions face. We predict an interactive backlash for younger women and younger men that is rooted in intersectional stereotypes compared to the stereotypes based on single demographic categories (i.e., age or gender stereotypes). To test our hypotheses, we collect data from a heterogeneous sample (N = 900) of U.S. citizens between 25 and 69 years. We conduct an experimental online study with a between-participant design to examine the backlash against younger women and younger men. Dataset for: Daldrop, C., Buengeler, C., & Homan, A. C. (2022). An Intersectional Lens on Leadership: Prescriptive Stereotypes towards Younger Women and Younger Men and their Effect on Leadership Perception. PsychArchives. https://doi.org/10.23668/psycharchives.5404 Dataset for: Daldrop, C., Buengeler, C., & Homan, A. C. (2023). An intersectional lens on young leaders: bias toward young women and young men in leadership positions. In Frontiers in Psychology (Vol. 14). Frontiers Media SA. https://doi.org/10.3389/fpsyg.2023.120454 Research has recognized age biases against young leaders, yet understanding of how gender, the most frequently studied demographic leader characteristic, influences this bias remains limited. In this study, we examine the gender-specific age bias toward young female and young male leaders through an intersectional lens. By integrating intersectionality theory with insights on status beliefs associated with age and gender, we test whether young female and male leaders face an interactive rather than an additive form of bias. We conducted two preregistered experimental studies (N1 = 918 and N2 = 985), where participants evaluated leaders based on age, gender, or a combination of both. Our analysis reveals a negative age bias in leader status ascriptions toward young leaders compared to middle-aged and older leaders. This bias persists when gender information is added, as demonstrated in both intersectional categories of young female and young male leaders. This bias pattern does not extend to middle-aged or older female and male leaders, thereby supporting the age bias against young leaders specifically. Interestingly, we also examined whether social dominance orientation strengthens the bias against young (male) leaders, but our results (reported in the SOM) are not as hypothesized. In sum, our results emphasize the importance of young age as a crucial demographic characteristic in leadership perceptions that can even overshadow the role of gender.: Raw Data File
Creating points from addresses in ArcGIS Online lesson. http://arcg.is/2vEljQx