100+ datasets found

r
Output Data From: A Research Graph dataset for connecting research data...
researchdata.edu.au
Updated Feb 19, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amir Aryani; Jingbo Wang; Amir Aryani; Marta Poblet; Kathryn Unsworth; Jingbo Wang; Ben Evans; Anusuriya Devaraju; Brigitte Hausstein; Claus-Peter Klas; Benjamin Zapilko; Samuele Kaplun (2018). Output Data From: A Research Graph dataset for connecting research data repositories using RD-Switchboard [Dataset]. http://doi.org/10.4225/03/58ddd8315c762
Explore at:
Unique identifier
https://doi.org/10.4225/03/58ddd8315c762
Dataset updated
Feb 19, 2018
Dataset provided by
Monash University
Authors
Amir Aryani; Jingbo Wang; Amir Aryani; Marta Poblet; Kathryn Unsworth; Jingbo Wang; Ben Evans; Anusuriya Devaraju; Brigitte Hausstein; Claus-Peter Klas; Benjamin Zapilko; Samuele Kaplun
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The open access graph dataset that shows the connections between Dryad, CERN, ANDS and other international data repositories to publications and grants across multiple research data infrastructures. The graph dataset was created using the Research Graph data model and the Research Data Switchboard (RD-Switchboard), a collaborative project by the Research Data Alliance DDRI Working Group (DDRI WG) with the aim to discover and connect the related research datasets based on publication co-authorship or jointly funded grants.
d
Everglades Vulnerability Analysis (EVA) modeling scripts and output
catalog.data.gov
data.usgs.gov
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Everglades Vulnerability Analysis (EVA) modeling scripts and output [Dataset]. https://catalog.data.gov/dataset/everglades-vulnerability-analysis-eva-modeling-scripts-and-output
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Everglades
Description
The Everglades Vulnerability Analysis (EVA) is a series of connected Bayesian networks that models the landscape-scale response of indicators of Everglades ecosystem health to changes in hydrology and salinity on the landscape. Using the uncertainty built into each network, it also produces surfaces of vulnerability in relation to user-defined ‘ideal’ outcomes. This dataset includes the code used to build the modules and generate outputs of module outcome probabilities and landscape vulnerability.
Open Data Portal Catalogue
open.canada.ca
datasets.ai
+1more
csv, json, jsonl, png +2
Updated Jul 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Treasury Board of Canada Secretariat (2025). Open Data Portal Catalogue [Dataset]. https://open.canada.ca/data/en/dataset/c4c5c7f1-bfa6-4ff6-b4a0-c164cb2060f7
Explore at:
csv, sqlite, json, png, jsonl, xlsxAvailable download formats
Dataset updated
Jul 27, 2025
Dataset provided by
Treasury Board of Canada Secretariathttp://www.tbs-sct.gc.ca/
Treasury Board of Canadahttps://www.canada.ca/en/treasury-board-secretariat/corporate/about-treasury-board.html
License
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
Description
The open data portal catalogue is a downloadable dataset containing some key metadata for the general datasets available on the Government of Canada's Open Data portal. Resource 1 is generated using the ckanapi tool (external link) Resources 2 - 8 are generated using the Flatterer (external link) utility. ###Description of resources: 1. Dataset is a JSON Lines (external link) file where the metadata of each Dataset/Open Information Record is one line of JSON. The file is compressed with GZip. The file is heavily nested and recommended for users familiar with working with nested JSON. 2. Catalogue is a XLSX workbook where the nested metadata of each Dataset/Open Information Record is flattened into worksheets for each type of metadata. 3. datasets metadata contains metadata at the dataset level. This is also referred to as the package in some CKAN documentation. This is the main table/worksheet in the SQLite database and XLSX output. 4. Resources Metadata contains the metadata for the resources contained within each dataset. 5. resource views metadata contains the metadata for the views applied to each resource, if a resource has a view configured. 6. datastore fields metadata contains the DataStore information for CSV datasets that have been loaded into the DataStore. This information is displayed in the Data Dictionary for DataStore enabled CSVs. 7. Data Package Fields contains a description of the fields available in each of the tables within the Catalogue, as well as the count of the number of records each table contains. 8. data package entity relation diagram Displays the title and format for column, in each table in the Data Package in the form of a ERD Diagram. The Data Package resource offers a text based version. 9. SQLite Database is a .db database, similar in structure to Catalogue. This can be queried with database or analytical software tools for doing analysis.
Z
Dataset of IEEE 802.11 probe requests from an uncontrolled urban environment...
data.niaid.nih.gov
Updated Jan 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aleš Simončič (2023). Dataset of IEEE 802.11 probe requests from an uncontrolled urban environment [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7509279
Explore at:
Dataset updated
Jan 6, 2023
Dataset provided by
Andrej Hrovat
Aleš Simončič
Miha Mohorčič
Mihael Mohorčič
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Introduction

The 802.11 standard includes several management features and corresponding frame types. One of them are Probe Requests (PR), which are sent by mobile devices in an unassociated state to scan the nearby area for existing wireless networks. The frame part of PRs consists of variable-length fields, called Information Elements (IE), which represent the capabilities of a mobile device, such as supported data rates.

This dataset contains PRs collected over a seven-day period by four gateway devices in an uncontrolled urban environment in the city of Catania.

It can be used for various use cases, e.g., analyzing MAC randomization, determining the number of people in a given location at a given time or in different time periods, analyzing trends in population movement (streets, shopping malls, etc.) in different time periods, etc.

Related dataset

Same authors also produced the Labeled dataset of IEEE 802.11 probe requests with same data layout and recording equipment.

Measurement setup

The system for collecting PRs consists of a Raspberry Pi 4 (RPi) with an additional WiFi dongle to capture WiFi signal traffic in monitoring mode (gateway device). Passive PR monitoring is performed by listening to 802.11 traffic and filtering out PR packets on a single WiFi channel.

The following information about each received PR is collected: - MAC address - Supported data rates - extended supported rates - HT capabilities - extended capabilities - data under extended tag and vendor specific tag - interworking - VHT capabilities - RSSI - SSID - timestamp when PR was received.

The collected data was forwarded to a remote database via a secure VPN connection. A Python script was written using the Pyshark package to collect, preprocess, and transmit the data.

Data preprocessing

The gateway collects PRs for each successive predefined scan interval (10 seconds). During this interval, the data is preprocessed before being transmitted to the database. For each detected PR in the scan interval, the IEs fields are saved in the following JSON structure:

PR_IE_data = { 'DATA_RTS': {'SUPP': DATA_supp , 'EXT': DATA_ext}, 'HT_CAP': DATA_htcap, 'EXT_CAP': {'length': DATA_len, 'data': DATA_extcap}, 'VHT_CAP': DATA_vhtcap, 'INTERWORKING': DATA_inter, 'EXT_TAG': {'ID_1': DATA_1_ext, 'ID_2': DATA_2_ext ...}, 'VENDOR_SPEC': {VENDOR_1:{ 'ID_1': DATA_1_vendor1, 'ID_2': DATA_2_vendor1 ...}, VENDOR_2:{ 'ID_1': DATA_1_vendor2, 'ID_2': DATA_2_vendor2 ...} ...} }

Supported data rates and extended supported rates are represented as arrays of values that encode information about the rates supported by a mobile device. The rest of the IEs data is represented in hexadecimal format. Vendor Specific Tag is structured differently than the other IEs. This field can contain multiple vendor IDs with multiple data IDs with corresponding data. Similarly, the extended tag can contain multiple data IDs with corresponding data.
Missing IE fields in the captured PR are not included in PR_IE_DATA.

When a new MAC address is detected in the current scan time interval, the data from PR is stored in the following structure:

{'MAC': MAC_address, 'SSIDs': [ SSID ], 'PROBE_REQs': [PR_data] },

where PR_data is structured as follows:

{ 'TIME': [ DATA_time ], 'RSSI': [ DATA_rssi ], 'DATA': PR_IE_data }.

This data structure allows to store only 'TOA' and 'RSSI' for all PRs originating from the same MAC address and containing the same 'PR_IE_data'. All SSIDs from the same MAC address are also stored. The data of the newly detected PR is compared with the already stored data of the same MAC in the current scan time interval. If identical PR's IE data from the same MAC address is already stored, only data for the keys 'TIME' and 'RSSI' are appended. If identical PR's IE data from the same MAC address has not yet been received, then the PR_data structure of the new PR for that MAC address is appended to the 'PROBE_REQs' key. The preprocessing procedure is shown in Figure ./Figures/Preprocessing_procedure.png

At the end of each scan time interval, all processed data is sent to the database along with additional metadata about the collected data, such as the serial number of the wireless gateway and the timestamps for the start and end of the scan. For an example of a single PR capture, see the Single_PR_capture_example.json file.

Folder structure

For ease of processing of the data, the dataset is divided into 7 folders, each containing a 24-hour period. Each folder contains four files, each containing samples from that device.

The folders are named after the start and end time (in UTC). For example, the folder 2022-09-22T22-00-00_2022-09-23T22-00-00 contains samples collected between 23th of September 2022 00:00 local time, until 24th of September 2022 00:00 local time.

Files representing their location via mapping: - 1.json -> location 1 - 2.json -> location 2 - 3.json -> location 3 - 4.json -> location 4

Environments description

The measurements were carried out in the city of Catania, in Piazza Università and Piazza del Duomo The gateway devices (rPIs with WiFi dongle) were set up and gathering data before the start time of this dataset. As of September 23, 2022, the devices were placed in their final configuration and personally checked for correctness of installation and data status of the entire data collection system. Devices were connected either to a nearby Ethernet outlet or via WiFi to the access point provided.

Four Raspbery Pi-s were used: - location 1 -> Piazza del Duomo - Chierici building (balcony near Fontana dell’Amenano) - location 2 -> southernmost window in the building of Via Etnea near Piazza del Duomo - location 3 -> nothernmost window in the building of Via Etnea near Piazza Università - location 4 -> first window top the right of the entrance of the University of Catania

Locations were suggested by the authors and adjusted during deployment based on physical constraints (locations of electrical outlets or internet access) Under ideal circumstances, the locations of the devices and their coverage area would cover both squares and the part of Via Etna between them, with a partial overlap of signal detection. The locations of the gateways are shown in Figure ./Figures/catania.png.

Known dataset shortcomings

Due to technical and physical limitations, the dataset contains some identified deficiencies.

PRs are collected and transmitted in 10-second chunks. Due to the limited capabilites of the recording devices, some time (in the range of seconds) may not be accounted for between chunks if the transmission of the previous packet took too long or an unexpected error occurred.

Every 20 minutes the service is restarted on the recording device. This is a workaround for undefined behavior of the USB WiFi dongle, which can no longer respond. For this reason, up to 20 seconds of data will not be recorded in each 20-minute period.

The devices had a scheduled reboot at 4:00 each day which is shown as missing data of up to a few minutes.

Location 1 - Piazza del Duomo - Chierici

The gateway device (rPi) is located on the second floor balcony and is hardwired to the Ethernet port. This device appears to function stably throughout the data collection period. Its location is constant and is not disturbed, dataset seems to have complete coverage.

Location 2 - Via Etnea - Piazza del Duomo

The device is located inside the building. During working hours (approximately 9:00-17:00), the device was placed on the windowsill. However, the movement of the device cannot be confirmed. As the device was moved back and forth, power outages and internet connection issues occurred. The last three days in the record contain no PRs from this location.

Location 3 - Via Etnea - Piazza Università

Similar to Location 2, the device is placed on the windowsill and moved around by people working in the building. Similar behavior is also observed, e.g., it is placed on the windowsill and moved inside a thick wall when no people are present. This device appears to have been collecting data throughout the whole dataset period.

Location 4 - Piazza Università

This location is wirelessly connected to the access point. The device was placed statically on a windowsill overlooking the square. Due to physical limitations, the device had lost power several times during the deployment. The internet connection was also interrupted sporadically.

Recognitions

The data was collected within the scope of Resiloc project with the help of City of Catania and project partners.
Data Scientists vs Size of Datasets
kaggle.com
Updated Oct 18, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Laurae (2016). Data Scientists vs Size of Datasets [Dataset]. https://www.kaggle.com/datasets/laurae2/data-scientists-vs-size-of-datasets/suggestions?status=pending
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 18, 2016
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Laurae
Description
This research study was conducted to analyze the (potential) relationship between hardware and data set sizes. 100 data scientists from France between Jan-2016 and Aug-2016 were interviewed in order to have exploitable data. Therefore, this sample might not be representative of the true population.

What can you do with the data?

Look up whether Kagglers has "stronger" hardware than non-Kagglers

Whether there is a correlation between a preferred data set size and hardware

Is proficiency a predictor of specific preferences?

Are data scientists more Intel or AMD?

How spread is GPU computing, and is there any relationship with Kaggling?

Are you able to predict the amount of euros a data scientist might invest, provided their current workstation details?

I did not find any past research on a similar scale. You are free to play with this data set. For re-usage of this data set out of Kaggle, please contact the author directly on Kaggle (use "Contact User"). Please mention:

Your intended usage (research? business use? blogging?...)

Your first/last name

Arbitrarily, we chose characteristics to describe Data Scientists and data set sizes.

Data set size:

Small: under 1 million values

Medium: between 1 million and 1 billion values

Large: over 1 billion values

For the data, it uses the following fields (DS = Data Scientist, W = Workstation):

DS_1 = Are you working with "large" data sets at work? (large = over 1 billion values) => Yes or No

DS_2 = Do you enjoy working with large data sets? => Yes or No

DS_3 = Would you rather have small, medium, or large data sets for work? => Small, Medium, or Large

DS_4 = Do you have any presence at Kaggle or any other Data Science platforms? => Yes or No

DS_5 = Do you view yourself proficient at working in Data Science? => Yes, A bit, or No

W_1 = What is your CPU brand? => Intel or AMD

W_2 = Do you have access to a remote server to perform large workloads? => Yes or No

W_3 = How much Euros would you invest in Data Science brand new hardware? => numeric output, rounded by 100s

W_4 = How many cores do you have to work with data sets? => numeric output

W_5 = How much RAM (in GB) do you have to work with data sets? => numeric output

W_6 = Do you do GPU computing? => Yes or No

W_7 = What programming languages do you use for Data Science? => R or Python (any other answer accepted)

W_8 = What programming languages do you use for pure statistical analysis? => R or Python (any other answer accepted)

W_9 = What programming languages do you use for training models? => R or Python (any other answer accepted)

You should expect potential noise in the data set. It might not be "free" of internal contradictions, as with all researches.
Z
LAU1 dataset
data.niaid.nih.gov
zenodo.org
Updated Nov 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Páleník, Michal (2024). LAU1 dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6165135
Explore at:
Dataset updated
Nov 29, 2024
Dataset authored and provided by
Páleník, Michal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Statistical open data on LAU regions of Slovakia, Czech Republic, Poland, Hungary (and other countries in the future). LAU1 regions are called counties, okres, okresy, powiat, járás, járási, NUTS4, LAU, Local Administrative Units, ... and there are 733 of them in this V4 dataset. Overall, we cover 733 regions which are described by 137.828 observations (panel data rows) and more than 1.760.229 data points.

This LAU dataset contains panel data on population, on age structure of inhabitants, on number and on structure of registered unemployed. Dataset prepared by Michal Páleník. Output files are in json, shapefiles, xls, ods, json, topojson or CSV formats. Downloadable at zenodo.org.

This dataset consists of:

data on unemployment (by gender, education and duration of unemployment),

data on vacancies,

open data on population in Visegrad counties (by age and gender),

data on unemployment share.

Combined latest dataset

dataset of the latest available data on unemployment, vacancies and population

dataset includes map contours (shp, topojson or geojson format), relation id in OpenStreetMap, wikidata entry code,

it also includes NUTS4 code, LAU1 code used by national statistical office and abbreviation of the region (usually license plate),

source of map contours is OpenStreetMap, licensed under ODbL

no time series, only most recent data on population and unemployment combined in one output file

columns: period, lau, name, registered_unemployed, registered_unemployed_females, disponible_unemployed, low_educated, long_term, unemployment_inflow, unemployment_outflow, below_25, over_55, vacancies, pop_period, TOTAL, Y15-64, Y15-64-females, local_lau, osm_id, abbr, wikidata, population_density, area_square_km, way

Slovakia – SK: 79 LAU1 regions, data for 2024-10-01, 1.659 data,

Czech Republic – CZ: 77 LAU1 regions, data for 2024-10-01, 1.617 data,

Poland – PL: 380 LAU1 regions, data for 2024-09-01, 6.840 data,

Hungary – HU: 197 LAU1 regions, data for 2024-10-01, 2.955 data,

13.071 data in total.

column/number of observations description SK CZ PL HU

period period (month and year) the data is for 79 77 380 197

lau LAU code of the region 79 77 380 197

name name of the region in local language 79 77 380 197

registered_unemployed number of unemployed registered at labour offices 79 77 380 197

registered_unemployed_females number of unemployed women 79 77 380 197

disponible_unemployed unemployed able to accept job offer 79 77 0 0

low_educated unmployed without secondary school (ISCED 0 and 1) 79 77 380 197

long_term unemployed for longer than 1 year 79 77 380 0

unemployment_inflow inflow into unemployment 79 77 0 0

unemployment_outflow outflow from unemployment 79 77 0 0

below_25 number of unemployed below 25 years of age 79 77 380 197

over_55 unemployed older than 55 years 79 77 380 197

vacancies number of vacancies reported by labour offices 79 77 380 0

pop_period date of population data 79 77 380 197

TOTAL total population 79 77 380 197

Y15-64 number of people between 15 and 64 years of age, population in economically active age 79 77 380 197

Y15-64-females number of women between 15 and 64 years of age 79 77 380 197

local_lau region's code used by local labour offices 79 77 380 197

osm_id relation id in OpenStreetMap database 79 77 380 197

abbr abbreviation used for this region 79 77 380 0

wikidata wikidata identification code 79 77 380 197

population_density population density 79 77 380 197

area_square_km area of the region in square kilometres 79 77 380 197

way geometry, polygon of given region 79 77 380 197

Unemployment dataset

time series of unemployment data in Visegrad regions

by gender, duration of unemployment, education level, age groups, vacancies,

columns: period, lau, name, registered_unemployed, registered_unemployed_females, disponible_unemployed, low_educated, long_term, unemployment_inflow, unemployment_outflow, below_25, over_55, vacancies

Slovakia – SK: 79 LAU1 regions, data for 334 periods (1997-01-01 ... 2024-10-01), 202.082 data,

Czech Republic – CZ: 77 LAU1 regions, data for 244 periods (2004-07-01 ... 2024-10-01), 147.528 data,

Poland – PL: 380 LAU1 regions, data for 189 periods (2005-03-01 ... 2024-09-01), 314.100 data,

Hungary – HU: 197 LAU1 regions, data for 106 periods (2016-01-01 ... 2024-10-01), 104.408 data,

768.118 data in total.

column/number of observations description SK CZ PL HU

period period (month and year) the data is for 26 386 18 788 71 772 20 882

lau LAU code of the region 26 386 18 788 71 772 20 882

name name of the region in local language 26 386 18 788 71 772 20 882

registered_unemployed number of unemployed registered at labour offices 26 386 18 788 71 772 20 882

registered_unemployed_females number of unemployed women 26 386 18 788 62 676 20 882

disponible_unemployed unemployed able to accept job offer 25 438 18 788 0 0

low_educated unmployed without secondary school (ISCED 0 and 1) 11 771 9855 41 388 20 881

long_term unemployed for longer than 1 year 24 253 9855 41 388 0

unemployment_inflow inflow into unemployment 26 149 16 478 0 0

unemployment_outflow outflow from unemployment 26 149 16 478 0 0

below_25 number of unemployed below 25 years of age 11 929 9855 17 100 20 881

over_55 unemployed older than 55 years 11 929 9855 17 100 20 882

vacancies number of vacancies reported by labour offices 11 692 18 788 62 676 0

Population dataset

time series on population by gender and 5 year age groups in V4 counties

columns: period, lau, name, gender, TOTAL, Y00-04, Y05-09, Y10-14, Y15-19, Y20-24, Y25-29, Y30-34, Y35-39, Y40-44, Y45-49, Y50-54, Y55-59, Y60-64, Y65-69, Y70-74, Y75-79, Y80-84, Y85-89, Y90-94, Y_GE95, Y15-64

Slovakia – SK: 79 LAU1 regions, data for 28 periods (1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023), 152.628 data,

Czech Republic – CZ: 78 LAU1 regions, data for 24 periods (2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023), 125.862 data,

Poland – PL: 382 LAU1 regions, data for 29 periods (1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023), 626.941 data,

Hungary – HU: 197 LAU1 regions, data for 11 periods (2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023), 86.680 data,

992.111 data in total.

column/number of observations description SK CZ PL HU

period period (month and year) the data is for 6636 5574 32 883 4334

lau LAU code of the region 6636 5574 32 883 4334

name name of the region in local language 6636 5574 32 883 4334

gender gender (male or female) 6636 5574 32 883 4334

TOTAL total population 6636 5574 32 503 4334

Y00-04 inhabitants between 00 to 04 years inclusive 6636 5574 32 503 4334

Y05-09 number of inhabitants between 05 to 09 years of age 6636 5574 32 503 4334

Y10-14 number of people between 10 to 14 years inclusive 6636 5574 32 503 4334

Y15-19 number of inhabitants between 15 to 19 years of age 6636 5574 32 503 4334

Y20-24 number of people between 20 to 24 years inclusive 6636 5574 32 503 4334

Y25-29 number of inhabitants between 25 to 29 years of age 6636 5574 32 503 4334

Y30-34 inhabitants between 30 to 34 years inclusive 6636 5574 32 503 4334

Y35-39 number of inhabitants between 35 to 39 years of age 6636 5574 32 503 4334

Y40-44 inhabitants between 40 to 44 years inclusive 6636 5574 32 503 4334

Y45-49 number of inhabitants younger than 49 and older than 45 years 6636 5574 32 503 4334

Y50-54 inhabitants between 50 to 54 years inclusive 6636 5574 32 503 4334

Y55-59 number of inhabitants between 55 to 59 years of age 6636 5574 32 503 4334

Y60-64 inhabitants between 60 to 64 years inclusive 6636 5574 32 503 4334

Y65-69 number of inhabitants younger than 69 and older than 65 years 6636 5574 32 503 4334

Y70-74 inhabitants between 70 to 74 years inclusive 6636 5574 24 670 4334

Y75-79 number of inhabitants between 75 to 79 years of age 6636 5574 24 670 4334

Y80-84 number of people between 80 to 84 years inclusive 6636 5574 24 670 4334

Y85-89 number of inhabitants younger than 89 and older than 85 years 6636 5574 0 0

Y90-94 inhabitants between 90 to 94 years inclusive 6636 5574 0 0

Y_GE95 number of people 95 years or older 6636 3234 0 0

Y15-64 number of people between 15 and 64 years of age, population in economically active age 6636 5574 32 503 4334

Notes

more examples at www.iz.sk

NUTS4 / LAU1 / LAU codes for HU and PL are created by me, so they can (and will) change in the future; CZ and SK NUTS4 codes are used by local statistical offices, so they should be more stable

NUTS4 codes are consistent with NUTS3 codes used by Eurostat

local_lau variable is an identifier used by local statistical office

abbr is abbreviation of region's name, used for map purposes (usually cars' license plate code; except for Hungary)

wikidata is code used by wikidata

osm_id is region's relation number in the OpenStreetMap database

Example outputs

you can download data in CSV, xml, ods, xlsx, shp, SQL, postgis, topojson, geojson or json format at 📥 doi:10.5281/zenodo.6165135

Counties of Slovakia – unemployment rate in Slovak LAU1 regions

Regions of the Slovak Republic

Unemployment of Czechia and Slovakia – unemployment share in LAU1 regions of Slovakia and Czechia

interactive map on unemployment in Slovakia

Slovakia – SK, Czech Republic – CZ, Hungary – HU, Poland – PL, NUTS3 regions of Slovakia

download at 📥 doi:10.5281/zenodo.6165135

suggested citation: Páleník, M. (2024). LAU1 dataset [Data set]. IZ Bratislava. https://doi.org/10.5281/zenodo.6165135
e
Relationship and Entity Extraction Evaluation Dataset (Entities)
data.europa.eu
data.wu.ac.at
json
Updated Oct 30, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Defence Science and Technology Laboratory (2021). Relationship and Entity Extraction Evaluation Dataset (Entities) [Dataset]. https://data.europa.eu/data/datasets/relationship-and-entity-extraction-evaluation-dataset-entities
Explore at:
jsonAvailable download formats
Dataset updated
Oct 30, 2021
Dataset authored and provided by
Defence Science and Technology Laboratory
Description
This entities dataset was the output of a project aimed to create a 'gold standard' dataset that could be used to train and validate machine learning approaches to natural language processing (NLP). The project was carried out by Aleph Insights and Committed Software on behalf of the Defence Science and Technology Laboratory (Dstl). The data set specifically focusing on entity and relationship extraction relevant to somebody operating in the role of a defence and security intelligence analyst. The dataset was therefore constructed using documents and structured schemas that were relevant to the defence and security analysis domain. A number of data subsets were produced (this is the BBC Online data subset). Further information about this data subset (BBC Online) and the others produced (together with licence conditions, attribution and schemas) many be found at the main project GitHub repository webpage (https://github.com/dstl/re3d). Note that the 'entities.json' file is to be used together with the 'documents.json' and 'relations.json' files (also found on this data.gov.uk webpage and their structures and relationship described on the given GitHub webpage.
V
"Digging into the DEA's pain pill database" from the Washington Post
data.virginia.gov
html
Updated Feb 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Other (2024). "Digging into the DEA's pain pill database" from the Washington Post [Dataset]. https://data.virginia.gov/dataset/digging-into-the-dea-s-pain-pill-database-from-the-washington-post
Explore at:
htmlAvailable download formats
Dataset updated
Feb 3, 2024
Dataset authored and provided by
Other
Description
From the Web site: The Post gained access to the Drug Enforcement Administration’s Automation of Reports and Consolidated Orders System, known as ARCOS, as the result of a court order. The Post and HD Media, which publishes the Charleston Gazette-Mail in West Virginia, waged a year-long legal battle for access to the database, which the government and the drug industry had sought to keep secret.

The version of the database published by The Post allows readers to learn how much hydrocodone and oxycodone went to individual states and counties, and which companies and distributors were responsible.

Also: Guidelines for using this data Fill out the form below to establish a connection with our team and report any issues downloading the data. This will also allow us to update you with any additional information as it comes out and answer questions you may have. Because of the volume of requests, we ask you use this channel rather than emailing our reporters individually. If you publish an online story, graphic, map or other piece of journalism based on this data set, please credit The Washington Post, link to the original source, and send us an email when you’ve hit publish. We want to learn what you discover and will attempt to link to your work as part of cataloguing the impact of this project. Post reporting and graphics can be used on-air. We ask for oral or on-screen credit to The Washington Post. For specific requests, including interview with Post journalists, please email postpr@washpost.com.
o
Constraints Real Time Meter Readings
ukpowernetworks.opendatasoft.com
Updated Aug 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Constraints Real Time Meter Readings [Dataset]. https://ukpowernetworks.opendatasoft.com/explore/dataset/ukpn-constraints-real-time-meter-readings/
Explore at:
Dataset updated
Aug 2, 2025
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Introduction Generation customers connected to UK Power Networks can be subjected to curtailment through our Distributed Energy Resource Management System (DERMS) if they accepted a curtailable-connection. During periods of network congestion, these DERS will have their access reduced to mitigate network constraint breaches. Their reduction is organised according to their connection application date in a last-in first-out (LIFO) arrangement. The Constraints Real Time Meter Readings dataset on the Open Data Portal (ODP) gives a near real time status of the constraints on our network that are used by DERMS to reduce access. This API accessible dataset can be used to see just how congested the network is, and for the specific DER operators themselves, they have access and visibility to the constraints of their specific site. The dataset contains a timestamp, the constraint identifier, the most recent current reading in amps, the trim and release limits (curtailment starts at the trim and ends at the release), whether the site is in breach, a description of the constraint, and (only if you have access) the name of the DER. The dataset updates as close to real time as is possible. Our scheduling is as follows:

At 15s past the minute mark, we scrape the network data and push it to the ODP server On the minute mark, the ODP runs an update to refresh the dataset The dataset refresh is completed between 5-15s past the minute mark Only after this refresh has completed can you get the latest values from the ODP

You can run this notebook to see the dataset in action: https://colab.research.google.com/drive/1Czx98U6zttlA3PC2OfI_0UzAbE48BvEq?usp=sharing

Methodological Approach

A Remote Terminal Unit (RTU) is installed at each curtailable-connection site providing live telemetry data into the DERMS. It measures communications status, generator output, and mode of operation. RTUs are also installed at constraint locations (physical parts of the network, e.g., transformers, cables which may become overloaded under certain conditions). These are identified through planning power load studies. These RTUs monitor current at the constraint and communications status. The DERMS design integrates network topology information. This maps constraints to associated curtailable connections under different network running conditions, including the sensitivity of the constraints to each curtailable connection. In general, a 1MW reduction in generation of a customer will cause <1MW reduction at the constraint. Each constraint is registered to a GSP. DERMS monitors constraints against the associated breach limit. When a constraint limit is breached, DERMS calculates the amount of access reduction required from curtailable connections linked to the constraint to alleviate the breach. This calculation factors in the real-time level of generation of each customer and the sensitivity of the constraint to each generator. Access reduction is issued to each curtailable-connection via the RTU until the constraint limit breach is mitigated. Multiple constraints can apply to a curtailable-connection and constraint breaches can occur simultaneously. Where multiple constraint breaches act upon a single curtailable-connection, we apportion the access reduction of that connection to the constraint breaches depending on the relative magnitude of the breaches. Where customer curtailment occurs without any associated constraint breach, we categorize the curtailment as non-constraint driven. Future developments will include the reason for non-constraint driven curtailment.

Quality Control Statement Quality Control Measures include:

Manual review and correction of data inconsistencies. Use of additional verification steps to ensure accuracy in the methodology.

Assurance Statement The DSO Data Science Team checked to ensure data accuracy and consistency.

Other Download dataset information: Metadata (JSON) Definitions of key terms related to this dataset can be found in the Open Data Portal Glossary: https://ukpowernetworks.opendatasoft.com/pages/glossary/
f
Analysis of inter-country input-output table based on citation network: How...
plos.figshare.com
xlsx
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lizhi Xing (2023). Analysis of inter-country input-output table based on citation network: How to measure the competition and collaboration between industrial sectors on the global value chain [Dataset]. http://doi.org/10.1371/journal.pone.0184055
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0184055
Dataset updated
Jun 3, 2023
Dataset provided by
PLOS ONE
Authors
Lizhi Xing
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The input-output table is comprehensive and detailed in describing the national economic system with complex economic relationships, which embodies information of supply and demand among industrial sectors. This paper aims to scale the degree of competition/collaboration on the global value chain from the perspective of econophysics. Global Industrial Strongest Relevant Network models were established by extracting the strongest and most immediate industrial relevance in the global economic system with inter-country input-output tables and then transformed into Global Industrial Resource Competition Network/Global Industrial Production Collaboration Network models embodying the competitive/collaborative relationships based on bibliographic coupling/co-citation approach. Three indicators well suited for these two kinds of weighted and non-directed networks with self-loops were introduced, including unit weight for competitive/collaborative power, disparity in the weight for competitive/collaborative amplitude and weighted clustering coefficient for competitive/collaborative intensity. Finally, these models and indicators were further applied to empirically analyze the function of sectors in the latest World Input-Output Database, to reveal inter-sector competitive/collaborative status during the economic globalization.
Data from: Red wine DataSet
kaggle.com
Updated Aug 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Suraj_kumar_Gupta (2023). Red wine DataSet [Dataset]. https://www.kaggle.com/datasets/soorajgupta7/red-wine-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 21, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Suraj_kumar_Gupta
License
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Description
Datasets Description:

The datasets under discussion pertain to the red and white variants of Portuguese "Vinho Verde" wine. Detailed information is available in the reference by Cortez et al. (2009). These datasets encompass physicochemical variables as inputs and sensory variables as outputs. Notably, specifics regarding grape types, wine brand, and selling prices are absent due to privacy and logistical concerns.

Classification and Regression Tasks: One can interpret these datasets as being suitable for both classification and regression analyses. The classes are ordered, albeit imbalanced. For instance, the dataset contains a more significant number of normal wines compared to excellent or poor ones.

Dataset Contents: For a comprehensive understanding, readers are encouraged to review the work by Cortez et al. (2009). The input variables, derived from physicochemical tests, include: 1. Fixed acidity 2. Volatile acidity 3. Citric acid 4. Residual sugar 5. Chlorides 6. Free sulfur dioxide 7. Total sulfur dioxide 8. Density 9. pH 10. Sulphates 11. Alcohol

The output variable, based on sensory data, is denoted by: 12. Quality (score ranging from 0 to 10)

Usage Tips: A practical suggestion involves setting a threshold for the dependent variable, defining wines with a quality score of 7 or higher as 'good/1' and the rest as 'not good/0.' This facilitates meaningful experimentation with hyperparameter tuning using decision tree algorithms and analyzing ROC curves and AUC values.

Operational Workflow: To efficiently utilize the dataset, the following steps are recommended: 1. Utilize a File Reader (for csv) to a linear correlation node and an interactive histogram for basic Exploratory Data Analysis (EDA). 2. Employ a File Reader to a Rule Engine Node for transforming the 10-point scale to a dichotomous variable indicating 'good wine' and 'rest.' 3. Implement a Rule Engine Node output to an input of Column Filter node to filter out the original 10-point feature, thus preventing data leakage. 4. Apply a Column Filter Node output to the input of Partitioning Node to execute a standard train/test split (e.g., 75%/25%, choosing 'random' or 'stratified'). 5. Feed the Partitioning Node train data split output into the input of Decision Tree Learner node. 6. Connect the Partitioning Node test data split output to the input of Decision Tree predictor Node. 7. Link the Decision Tree Learner Node output to the input of Decision Tree Node. 8. Finally, connect the Decision Tree output to the input of ROC Node for model evaluation based on the AUC value.

Tools and Acknowledgments: For an efficient analysis, consider using KNIME, a valuable graphical user interface (GUI) tool. Additionally, the dataset is available on the UCI machine learning repository, and proper acknowledgment and citation of the dataset source by Cortez et al. (2009) are essential for use.
E
Atticus Open Contract Dataset (AOK) (beta)
live.european-language-grid.eu
explore.openaire.eu
+2more
csv
Updated Jun 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Atticus Open Contract Dataset (AOK) (beta) [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7648
Explore at:
csvAvailable download formats
Dataset updated
Jun 22, 2023
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Atticus Open Contract Dataset (AOK)(beta) is a corpus of 5,000+ labels in 200 commercial legal contracts that have been manually labeled by legal experts to identify 40 types of clauses that are important during contract review in connection with corporate transactions, such as mergers and acquisitions, IPO, and corporate financing.AOK Dataset is curated and maintained by The Atticus Project, Inc., a non-profit organization, to support NLP research and development in legal contract review. If you download this dataset, we'd love to know more about you and your project! Please fill out this short form: https://forms.gle/h47GUENTTbBqH39m7
Check out our website at atticusprojectai.org.
Update: The expanded 1.0 version of the dataset is available here https://zenodo.org/record/4595826
Alternative outputs based on primary model (packaged datasets) - A landscape...
catalog.data.gov
s.cnmilf.com
Updated Feb 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Fish and Wildlife Service (2025). Alternative outputs based on primary model (packaged datasets) - A landscape connectivity analysis for the coastal marten (Martes caurina humboldtensis) [Dataset]. https://catalog.data.gov/dataset/alternative-outputs-based-on-primary-model-packaged-datasets-a-landscape-connectivity-anal
Explore at:
Dataset updated
Feb 22, 2025
Dataset provided by
U.S. Fish and Wildlife Servicehttp://www.fws.gov/
Description
This packaged data collection contains two sets of two additional model runs that used the same inputs and parameters as our primary model, with the exception being we implemented a "maximum corridor length" constraint that allowed us to identify and visualize the corridors as being well-connected (≤15km) or moderately connected (≤45km). This is based on an assumption that corridors longer than 45km are too long to sufficiently accommodate dispersal. One of these sets is based on a maximum corridor length that uses Euclidean (straight-line) distance, while the other set is based on a maximum corridor length that uses cost-weighted distance. These two sets of corridors can be compared against the full set of corridors from our primary model to identify the remaining corridors, which could be considered poorly connected. This package includes the following data layers: Corridors classified as well connected (≤15km) based on Cost-weighted Distance Corridors classified as moderately connected (≤45km) based on Cost-weighted Distance Corridors classified as well connected (≤15km) based on Euclidean Distance Corridors classified as moderately connected (≤45km) based on Euclidean Distance Please refer to the embedded metadata and the information in our full report for details on the development of these data layers. Packaged data are available in two formats: Geodatabase (.gdb): A related set of file geodatabase rasters and feature classes, packaged in an ESRI file geodatabase. ArcGIS Pro Map Package (.mpkx): The same data included in the geodatabase, presented as fully-symbolized layers in a map. Note that you must have ArcGIS Pro version 2.0 or greater to view. See Cross-References for links to individual datasets, which can be downloaded in raster GeoTIFF (.tif) format.
Z
IoMT-TrafficData: A Dataset for Benchmarking Intrusion Detection in IoMT
data.niaid.nih.gov
zenodo.org
Updated Aug 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Costa, Rogério Luís (2024). IoMT-TrafficData: A Dataset for Benchmarking Intrusion Detection in IoMT [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8116337
Explore at:
Dataset updated
Aug 30, 2024
Dataset provided by
Bispo, Ivo Afonso
Santos, Leonel
Areia, José
Costa, Rogério Luís
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Article Information

The work involved in developing the dataset and benchmarking its use of machine learning is set out in the article ‘IoMT-TrafficData: Dataset and Tools for Benchmarking Intrusion Detection in Internet of Medical Things’. DOI: 10.1109/ACCESS.2024.3437214.

Please do cite the aforementioned article when using this dataset.

Abstract

The increasing importance of securing the Internet of Medical Things (IoMT) due to its vulnerabilities to cyber-attacks highlights the need for an effective intrusion detection system (IDS). In this study, our main objective was to develop a Machine Learning Model for the IoMT to enhance the security of medical devices and protect patients’ private data. To address this issue, we built a scenario that utilised the Internet of Things (IoT) and IoMT devices to simulate real-world attacks. We collected and cleaned data, pre-processed it, and provided it into our machine-learning model to detect intrusions in the network. Our results revealed significant improvements in all performance metrics, indicating robustness and reproducibility in real-world scenarios. This research has implications in the context of IoMT and cybersecurity, as it helps mitigate vulnerabilities and lowers the number of breaches occurring with the rapid growth of IoMT devices. The use of machine learning algorithms for intrusion detection systems is essential, and our study provides valuable insights and a road map for future research and the deployment of such systems in live environments. By implementing our findings, we can contribute to a safer and more secure IoMT ecosystem, safeguarding patient privacy and ensuring the integrity of medical data.

ZIP Folder Content

The ZIP folder comprises two main components: Captures and Datasets. Within the captures folder, we have included all the captures used in this project. These captures are organized into separate folders corresponding to the type of network analysis: BLE or IP-Based. Similarly, the datasets folder follows a similar organizational approach. It contains datasets categorized by type: BLE, IP-Based Packet, and IP-Based Flows.

To cater to diverse analytical needs, the datasets are provided in two formats: CSV (Comma-Separated Values) and pickle. The CSV format facilitates seamless integration with various data analysis tools, while the pickle format preserves the intricate structures and relationships within the dataset.

This organization enables researchers to easily locate and utilize the specific captures and datasets they require, based on their preferred network analysis type or dataset type. The availability of different formats further enhances the flexibility and usability of the provided data.

Datasets' Content

Within this dataset, three sub-datasets are available, namely BLE, IP-Based Packet, and IP-Based Flows. Below is a table of the features selected for each dataset and consequently used in the evaluation model within the provided work.

Identified Key Features Within Bluetooth Dataset

Feature Meaning

btle.advertising_header BLE Advertising Packet Header

btle.advertising_header.ch_sel BLE Advertising Channel Selection Algorithm

btle.advertising_header.length BLE Advertising Length

btle.advertising_header.pdu_type BLE Advertising PDU Type

btle.advertising_header.randomized_rx BLE Advertising Rx Address

btle.advertising_header.randomized_tx BLE Advertising Tx Address

btle.advertising_header.rfu.1 Reserved For Future 1

btle.advertising_header.rfu.2 Reserved For Future 2

btle.advertising_header.rfu.3 Reserved For Future 3

btle.advertising_header.rfu.4 Reserved For Future 4

btle.control.instant Instant Value Within a BLE Control Packet

btle.crc.incorrect Incorrect CRC

btle.extended_advertising Advertiser Data Information

btle.extended_advertising.did Advertiser Data Identifier

btle.extended_advertising.sid Advertiser Set Identifier

btle.length BLE Length

frame.cap_len Frame Length Stored Into the Capture File

frame.interface_id Interface ID

frame.len Frame Length Wire

nordic_ble.board_id Board ID

nordic_ble.channel Channel Index

nordic_ble.crcok Indicates if CRC is Correct

nordic_ble.flags Flags

nordic_ble.packet_counter Packet Counter

nordic_ble.packet_time Packet time (start to end)

nordic_ble.phy PHY

nordic_ble.protover Protocol Version

Identified Key Features Within IP-Based Packets Dataset

Feature Meaning

http.content_length Length of content in an HTTP response

http.request HTTP request being made

http.response.code Sequential number of an HTTP response

http.response_number Sequential number of an HTTP response

http.time Time taken for an HTTP transaction

tcp.analysis.initial_rtt Initial round-trip time for TCP connection

tcp.connection.fin TCP connection termination with a FIN flag

tcp.connection.syn TCP connection initiation with SYN flag

tcp.connection.synack TCP connection establishment with SYN-ACK flags

tcp.flags.cwr Congestion Window Reduced flag in TCP

tcp.flags.ecn Explicit Congestion Notification flag in TCP

tcp.flags.fin FIN flag in TCP

tcp.flags.ns Nonce Sum flag in TCP

tcp.flags.res Reserved flags in TCP

tcp.flags.syn SYN flag in TCP

tcp.flags.urg Urgent flag in TCP

tcp.urgent_pointer Pointer to urgent data in TCP

ip.frag_offset Fragment offset in IP packets

eth.dst.ig Ethernet destination is in the internal network group

eth.src.ig Ethernet source is in the internal network group

eth.src.lg Ethernet source is in the local network group

eth.src_not_group Ethernet source is not in any network group

arp.isannouncement Indicates if an ARP message is an announcement

Identified Key Features Within IP-Based Flows Dataset

Feature Meaning

proto Transport layer protocol of the connection

service Identification of an application protocol

orig_bytes Originator payload bytes

resp_bytes Responder payload bytes

history Connection state history

orig_pkts Originator sent packets

resp_pkts Responder sent packets

flow_duration Length of the flow in seconds

fwd_pkts_tot Forward packets total

bwd_pkts_tot Backward packets total

fwd_data_pkts_tot Forward data packets total

bwd_data_pkts_tot Backward data packets total

fwd_pkts_per_sec Forward packets per second

bwd_pkts_per_sec Backward packets per second

flow_pkts_per_sec Flow packets per second

fwd_header_size Forward header bytes

bwd_header_size Backward header bytes

fwd_pkts_payload Forward payload bytes

bwd_pkts_payload Backward payload bytes

flow_pkts_payload Flow payload bytes

fwd_iat Forward inter-arrival time

bwd_iat Backward inter-arrival time

flow_iat Flow inter-arrival time

active Flow active duration
o
AYNEC-Datasets
explore.openaire.eu
zenodo.org
Updated Feb 14, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Ayala; Agustín Borrego; Inma Hernández; Carlos Rivero; David Ruiz (2019). AYNEC-Datasets [Dataset]. http://doi.org/10.5281/zenodo.2564955
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.2564955
Dataset updated
Feb 14, 2019
Authors
Daniel Ayala; Agustín Borrego; Inma Hernández; Carlos Rivero; David Ruiz
Description
These datasets are presented in the article "AYNEC: All You Need for Evaluating Completion Techniques in Knowledge Graphs", sent for the ESWC19. Please, cite it in your work if you make use of them. The following datasets are included: WN18-AF, generated from WN18. WN18-AR, generated from WN18, removing inverses. WN11-AF, generated from WN11. WN11-AR, generated from WN11, removing inverses. FB13-A, generated from FB13. FB15K-AF, generated from FB15K. FB15K-AR, generated from FB15K, keeping relations that cover 95% of the graph and removing inverses. NELL-AF, generated from NELL. NELL-AR, generated from NELL, keeping relations that cover 95% of the graph and removing inverses. In all datasets, we removed relations with only one instance, used 20% of each relation in the graph for test, generated one negative for each positive in both training and testing by replacing the target of the positive with a random entity. In WN11 and WN18 all entities are potential candidates. In the rest of datasets, only entities that have appeared as targets of the relation are candidates. Two relations were considered inverses when there was a 90% overlap between them. That is, relationc A and B are inverses if for 90% of instances of A there is an instance of B with inversed source and target, and vice-versa. When removing inverses, the smallest of each pair of inverses was removed. Each zip file contains the following files about a dataset: train.txt - triples used for training. Each line contains the source, the relation, the target, and the label (1 for positives and -1 for negatives). test.txt - triples used for testing, following the same format. relations.txt - a list of the relations in the dataset, each with its frequency. entities.txt - a list of the entities in the dataset, eac with its total degree, inwards degree, and output degree. inverses.txt - a list of the inverses in the original graph, whether or not they were removed. Each inverse relationship is represented by a pair of relations. summary.html - the visual summary of the relation frequencies and entity degrees (without removed inverses). dataset.gexf - the entire dataset in the open graph format "gexf", which can be opened by applications such as Gephi. {"references": ["Ayala, D., Borrego, A., Hern\u00e1ndez, I., Rivero, C. R., & Ruiz, D. (2019, June). AYNEC: All You Need for Evaluating Completion Techniques in Knowledge Graphs. In European Semantic Web Conference (pp. 397-411). Springer, Cham."]}
r
QoG Social Policy Dataset - Wide Time-Series CS Data
researchdata.se
Updated Aug 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jan Teorell; Richard Svensson; Marcus Samanni; Staffan Kumlin; Stefan Dahlberg; Bo Rothstein; Sören Holmberg (2024). QoG Social Policy Dataset - Wide Time-Series CS Data [Dataset]. https://researchdata.se/en/catalogue/dataset/ext0004-3
Explore at:
Dataset updated
Aug 6, 2024
Dataset provided by
University of Gothenburg
Authors
Jan Teorell; Richard Svensson; Marcus Samanni; Staffan Kumlin; Stefan Dahlberg; Bo Rothstein; Sören Holmberg
Time period covered
1970 - 2005
Area covered
Austria, United Kingdom, Hungary, Canada, Lithuania, Sweden, Slovenia, Mexico, Finland, Belgium
Description
The QoG Institute is an independent research institute within the Department of Political Science at the University of Gothenburg. Overall 30 researchers conduct and promote research on the causes, consequences and nature of Good Governance and the Quality of Government - that is, trustworthy, reliable, impartial, uncorrupted and competent government institutions.

The main objective of our research is to address the theoretical and empirical problem of how political institutions of high quality can be created and maintained. A second objective is to study the effects of Quality of Government on a number of policy areas, such as health, the environment, social policy, and poverty.

The dataset was created as part of a research project titled “Quality of Government and the Conditions for Sustainable Social Policy”. The aim of the dataset is to promote cross-national comparative research on social policy output and its correlates, with a special focus on the connection between social policy and Quality of Government (QoG).

The data comes in three versions: one cross-sectional dataset, and two cross-sectional time-series datasets for a selection of countries. The two combined datasets are called “long” (year 1946-2009) and “wide” (year 1970-2005).

The data contains six types of variables, each provided under its own heading in the codebook: Social policy variables, Tax system variables, Social Conditions, Public opinion data, Political indicators, Quality of government variables.

QoG Social Policy Dataset can be downloaded from the Data Archive of the QoG Institute at http://qog.pol.gu.se/data/datadownloads/data-archive Its variables are now included in QoG Standard.

Purpose:

The primary aim of QoG is to conduct and promote research on corruption. One aim of the QoG Institute is to make publicly available cross-national comparative data on QoG and its correlates. The aim of the QoG Social Policy Dataset is to promote cross-national comparative research on social policy output and its correlates, with a special focus on the connection between social policy and Quality of Government (QoG).

The dataset combining cross-sectional data and time-series data for a selection of 40 countries. The dataset is specifically tailored for the analysis of public opinion data over time, instead uses country as its unit of observation, and one variable for every 5th year from 1970-2005 (or, one per module of each public opinion data source).

Samanni, Marcus. Jan Teorell, Staffan Kumlin, Stefan Dahlberg, Bo Rothstein, Sören Holmberg & Richard Svensson. 2012. The QoG Social Policy Dataset, version 4Apr12. University of Gothenburg:The Quality of Government Institute. http://www.qog.pol.gu.se
MORE: A Multimodal Relation Extraction Dataset
kaggle.com
Updated Oct 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marquis03 (2024). MORE: A Multimodal Relation Extraction Dataset [Dataset]. https://www.kaggle.com/datasets/marquis03/more-a-multimodal-relation-extraction-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 23, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Marquis03
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
arXiv: https://arxiv.org/abs/2312.09753

To construct the MORE dataset, we choose to use multimodal news data as a source rather than annotating existing MRE datasets primarily sourced from social media. Multimodal news data has selective and well-edited images and textual titles, resulting in relatively good data quality, and often contains timely and informative knowledge. We obtained the data from The New York Times English news and Yahoo News from 2019 to 2022, resulting in a candidate set of 15,000 multimodal news data instances covering various topics. We filtered out unqualified data and obtained a meticulously selected dataset for our research purposes. Then the candidate multimodal news was annotated in the following three distinct stages.

Stage 1: Entity Identification and Object Detection. We utilized the AllenNLP named entity recognition tool1 and the Yolo V5 object detection tool2 to identify the entities in textual news titles and the object areas in the corresponding news images. All extracted objects and entities were reviewed and corrected manually by our annotators.

Stage 2: Object-Entity Relation Annotation. We recruited well-educated annotators to examine the textual titles and images and deduce the relations between the entities and objects. Relations were randomly assigned to annotators from the candidate set to ensure an unbiased annotation process. The data did not clearly indicate any pre-defined relations will be labeled as none. At least two annotators are required to independently review and annotate each data. In cases where there were discrepancies or conflicts in the annotations, a third annotator was consulted, and their decision was considered final. The weighted Cohen's Kappa is used to measure the consistency between different annotators.

Stage 3: Object-Overlapped Data Filtering. To refine the scope of multimodal object-entity relation extraction task, we only focused on relations in which visual objects did not co-occur with any entities mentioned in the textual news titles. This process filtered down the data from 15,000 to over 3,000 articles containing more than 20,000 object-entity relational facts. This approach ensured a dataset of only relatable object-entity relationships illustrated in images, rather than those that were already mentioned explicitly in the textual news titles, resulting in a more focused dataset for the task.
Z
IDMT-SMT-Guitar Dataset
data.niaid.nih.gov
zenodo.org
Updated Nov 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Männchen, Andreas (2023). IDMT-SMT-Guitar Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7544109
Explore at:
Dataset updated
Nov 24, 2023
Dataset provided by
Männchen, Andreas
Kehling, Christian
Eppler, Arndt
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
The IDMT-SMT-GUITAR database is a large database for automatic guitar transcription. Seven different guitars in standard tuning were used with varying pick-up settings and different string measures to ensure a sufficient diversification in the field of electric and acoustic guitars. The recording setup consisted of appropriate audio interfaces, which were directly connected to the guitar output or in one case to a condenser microphone. The recordings are provided in one channel RIFF WAVE format with 44100 Hz sample rate.

The dataset consists of four subsets. The first contains all introduced playing techniques (plucking styles: finger-style, muted, picked; expression styles: normal, bending, slide, vibrato, harmonics, dead-notes) and is provided with a bit depth of 24 Bit. It has been recorded using three different guitars and consists of about 4700 note events with monophonic and polyphonic structure. As a particularity the recorded files contain realistic guitar licks ranging from monophonic to polyphonic instrument tracks.

The second subset of data consists of 400 monophonic and polyphonic note events each played with two different guitars. No expression styles were applied here and each note event was recorded and stored in a separate file with a bit depth of 16 Bit. The parameter annotations for the first and second subset are stored in XML format.

The third subset is made up of five short monophonic and polyphonic guitar recordings. All five pieces have been recorded with the same instrument and no special expression styles were applied. The files are stored with a bit depth of 16 Bit and each file is accompanied by a parameter annotation in XML format.

Additionally, a fourth subset is included, which was created for evaluation purposes in the context of chord recognition and rhythm style estimation tasks. This set contains recordings of 64 short musical pieces grouped by genre. Each piece has been recorded at two different tempi with three different guitars and is provided with a bit depth of 16 Bit. Annotations regarding onset positions, chords, rhythmic pattern length, and texture (monophony/polyphony) are included in various file formats.
A
‘Fitness Trends Dataset’ analyzed by Analyst-2
analyst-2.ai
Updated Jun 10, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2019). ‘Fitness Trends Dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-fitness-trends-dataset-586d/a6307b31/?iid=003-804&v=presentation
Explore at:
Dataset updated
Jun 10, 2019
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Fitness Trends Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/aroojanwarkhan/fitness-data-trends on 28 January 2022.

--- Dataset description provided by original source is as follows ---

Context

The motivation behind collecting this data-set was personal, with the objective of answering a simple question, “does exercise/working-out improve a person’s activeness?”. For the scope of this project a person’s activeness was the measure of their daily step-count (the number of steps they take in a day). Mood was measured in either "Happy", "Neutral" or "Sad" which were given numeric values of 300, 200 and 100 respectively. Feeling of activeness was measured in either "Active" or "Inactive" which were given numeric values of 500 and 0 respectively. I had noticed for a while that during the months when I was exercising regularly I felt more active and would move around a lot more. As opposed to when I was not working out, i would feel lethargic. I wanted to know for sure what the connection between exercise and activeness was. I started compiling the data on 6th October with the help Samsung Health application that was recording my daily step count and the number of calories burned. The purpose of the project was to establish through two sets of data (control and experimental) if working-out/exercise promotes an increase in the daily step-count or not.

Content

Date Step Count Calories Burned Mood Hours of Sleep Feeling or Activeness or Inactiveness Weight

Acknowledgements

Special thanks to Samsung Health that contributed to the set by providing daily step count and the number of calories burned.

Inspiration

"Does exercise/working-out improve a person’s activeness?”

--- Original source retains full ownership of the source dataset ---
Data from: Code and limited data for: Rotational complexity increases...
catalog.data.gov
agdatacommons.nal.usda.gov
Updated Apr 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Code and limited data for: Rotational complexity increases cropping system output under poorer growing conditions [Dataset]. https://catalog.data.gov/dataset/code-and-limited-data-for-rotational-complexity-increases-cropping-system-output-under-poo
Explore at:
Dataset updated
Apr 21, 2025
Dataset provided by
Agricultural Research Servicehttps://www.ars.usda.gov/
Description
This work was conducted by the Diverse Rotations Improve Valuable Ecosystem Services (DRIVES) project, based in the USDA-ARS Sustainable Agricultural Systems Lab in Beltsville, MD. The DRIVES team compiled a database of 20-plus long-term cropping systems experiments in North America in order to conduct cross-site research. This repository contains all scripts from our first research paper from the DRIVES database: "Rotational complexity increases cropping system output under poorer growing conditions," published in One Earth (in press). This analysis uses crop yield and experimental design data from the DRIVES database and public data sources for crop prices and inflation. This repository includes limited datasets derived from public sources or lacking connection to site IDs. We do not have permission to share the full primary dataset, but can provide data upon request with permission from site contacts.The scripts show all data setup, analysis, and visualization steps used to investigate how crop rotation diversity (defined by rotation length and the number of species) impacts productivity of whole rotations and component crops under varying growing conditions. We used Bayesian multilevel modeling fit to data from 20 long-term cropping systems datasets in North America (434 site-years, 36,000 observations). Rotation- and crop-level productivity were quantified as dollar output, using price coefficients derived from National Agriculture Statistics Service (NASS) price data (included in repository). Growing condtions were quantified using an Environmental Index calculated from site-year average output. Bayesian multilevel models were implemented using the 'brms' R package, which is a wrapper for Stan. Descriptions of all files are included in README.pdf.

Facebook

Twitter

Click to copy link

Link copied

Cite

Amir Aryani; Jingbo Wang; Amir Aryani; Marta Poblet; Kathryn Unsworth; Jingbo Wang; Ben Evans; Anusuriya Devaraju; Brigitte Hausstein; Claus-Peter Klas; Benjamin Zapilko; Samuele Kaplun (2018). Output Data From: A Research Graph dataset for connecting research data repositories using RD-Switchboard [Dataset]. http://doi.org/10.4225/03/58ddd8315c762

Output Data From: A Research Graph dataset for connecting research data repositories using RD-Switchboard

Explore at:

Unique identifier

https://doi.org/10.4225/03/58ddd8315c762

Dataset updated

Feb 19, 2018

Dataset provided by

Monash University

Authors

Amir Aryani; Jingbo Wang; Amir Aryani; Marta Poblet; Kathryn Unsworth; Jingbo Wang; Ben Evans; Anusuriya Devaraju; Brigitte Hausstein; Claus-Peter Klas; Benjamin Zapilko; Samuele Kaplun

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The open access graph dataset that shows the connections between Dryad, CERN, ANDS and other international data repositories to publications and grants across multiple research data infrastructures. The graph dataset was created using the Research Graph data model and the Research Data Switchboard (RD-Switchboard), a collaborative project by the Research Data Alliance DDRI Working Group (DDRI WG) with the aim to discover and connect the related research datasets based on publication co-authorship or jointly funded grants.

Clear search

Close search

Google apps

Main menu

Output Data From: A Research Graph dataset for connecting research data...

Everglades Vulnerability Analysis (EVA) modeling scripts and output

Open Data Portal Catalogue

Dataset of IEEE 802.11 probe requests from an uncontrolled urban environment...

Data Scientists vs Size of Datasets

LAU1 dataset

Relationship and Entity Extraction Evaluation Dataset (Entities)

"Digging into the DEA's pain pill database" from the Washington Post

Constraints Real Time Meter Readings

Analysis of inter-country input-output table based on citation network: How...

Data from: Red wine DataSet

Atticus Open Contract Dataset (AOK) (beta)

Alternative outputs based on primary model (packaged datasets) - A landscape...

IoMT-TrafficData: A Dataset for Benchmarking Intrusion Detection in IoMT

AYNEC-Datasets

QoG Social Policy Dataset - Wide Time-Series CS Data

MORE: A Multimodal Relation Extraction Dataset

IDMT-SMT-Guitar Dataset

‘Fitness Trends Dataset’ analyzed by Analyst-2

Context

Content

Acknowledgements

Inspiration

Data from: Code and limited data for: Rotational complexity increases...

Output Data From: A Research Graph dataset for connecting research data repositories using RD-Switchboard