Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The open access graph dataset that shows the connections between Dryad, CERN, ANDS and other international data repositories to publications and grants across multiple research data infrastructures. The graph dataset was created using the Research Graph data model and the Research Data Switchboard (RD-Switchboard), a collaborative project by the Research Data Alliance DDRI Working Group (DDRI WG) with the aim to discover and connect the related research datasets based on publication co-authorship or jointly funded grants.
The Everglades Vulnerability Analysis (EVA) is a series of connected Bayesian networks that models the landscape-scale response of indicators of Everglades ecosystem health to changes in hydrology and salinity on the landscape. Using the uncertainty built into each network, it also produces surfaces of vulnerability in relation to user-defined ‘ideal’ outcomes. This dataset includes the code used to build the modules and generate outputs of module outcome probabilities and landscape vulnerability.
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
The open data portal catalogue is a downloadable dataset containing some key metadata for the general datasets available on the Government of Canada's Open Data portal. Resource 1 is generated using the ckanapi tool (external link) Resources 2 - 8 are generated using the Flatterer (external link) utility. ###Description of resources: 1. Dataset is a JSON Lines (external link) file where the metadata of each Dataset/Open Information Record is one line of JSON. The file is compressed with GZip. The file is heavily nested and recommended for users familiar with working with nested JSON. 2. Catalogue is a XLSX workbook where the nested metadata of each Dataset/Open Information Record is flattened into worksheets for each type of metadata. 3. datasets metadata contains metadata at the dataset
level. This is also referred to as the package
in some CKAN documentation. This is the main
table/worksheet in the SQLite database and XLSX output. 4. Resources Metadata contains the metadata for the resources contained within each dataset. 5. resource views metadata contains the metadata for the views applied to each resource, if a resource has a view configured. 6. datastore fields metadata contains the DataStore information for CSV datasets that have been loaded into the DataStore. This information is displayed in the Data Dictionary for DataStore enabled CSVs. 7. Data Package Fields contains a description of the fields available in each of the tables within the Catalogue, as well as the count of the number of records each table contains. 8. data package entity relation diagram Displays the title and format for column, in each table in the Data Package in the form of a ERD Diagram. The Data Package resource offers a text based version. 9. SQLite Database is a .db
database, similar in structure to Catalogue. This can be queried with database or analytical software tools for doing analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction
The 802.11 standard includes several management features and corresponding frame types. One of them are Probe Requests (PR), which are sent by mobile devices in an unassociated state to scan the nearby area for existing wireless networks. The frame part of PRs consists of variable-length fields, called Information Elements (IE), which represent the capabilities of a mobile device, such as supported data rates.
This dataset contains PRs collected over a seven-day period by four gateway devices in an uncontrolled urban environment in the city of Catania.
It can be used for various use cases, e.g., analyzing MAC randomization, determining the number of people in a given location at a given time or in different time periods, analyzing trends in population movement (streets, shopping malls, etc.) in different time periods, etc.
Related dataset
Same authors also produced the Labeled dataset of IEEE 802.11 probe requests with same data layout and recording equipment.
Measurement setup
The system for collecting PRs consists of a Raspberry Pi 4 (RPi) with an additional WiFi dongle to capture WiFi signal traffic in monitoring mode (gateway device). Passive PR monitoring is performed by listening to 802.11 traffic and filtering out PR packets on a single WiFi channel.
The following information about each received PR is collected: - MAC address - Supported data rates - extended supported rates - HT capabilities - extended capabilities - data under extended tag and vendor specific tag - interworking - VHT capabilities - RSSI - SSID - timestamp when PR was received.
The collected data was forwarded to a remote database via a secure VPN connection. A Python script was written using the Pyshark package to collect, preprocess, and transmit the data.
Data preprocessing
The gateway collects PRs for each successive predefined scan interval (10 seconds). During this interval, the data is preprocessed before being transmitted to the database. For each detected PR in the scan interval, the IEs fields are saved in the following JSON structure:
PR_IE_data = { 'DATA_RTS': {'SUPP': DATA_supp , 'EXT': DATA_ext}, 'HT_CAP': DATA_htcap, 'EXT_CAP': {'length': DATA_len, 'data': DATA_extcap}, 'VHT_CAP': DATA_vhtcap, 'INTERWORKING': DATA_inter, 'EXT_TAG': {'ID_1': DATA_1_ext, 'ID_2': DATA_2_ext ...}, 'VENDOR_SPEC': {VENDOR_1:{ 'ID_1': DATA_1_vendor1, 'ID_2': DATA_2_vendor1 ...}, VENDOR_2:{ 'ID_1': DATA_1_vendor2, 'ID_2': DATA_2_vendor2 ...} ...} }
Supported data rates and extended supported rates are represented as arrays of values that encode information about the rates supported by a mobile device. The rest of the IEs data is represented in hexadecimal format. Vendor Specific Tag is structured differently than the other IEs. This field can contain multiple vendor IDs with multiple data IDs with corresponding data. Similarly, the extended tag can contain multiple data IDs with corresponding data.
Missing IE fields in the captured PR are not included in PR_IE_DATA.
When a new MAC address is detected in the current scan time interval, the data from PR is stored in the following structure:
{'MAC': MAC_address, 'SSIDs': [ SSID ], 'PROBE_REQs': [PR_data] },
where PR_data is structured as follows:
{ 'TIME': [ DATA_time ], 'RSSI': [ DATA_rssi ], 'DATA': PR_IE_data }.
This data structure allows to store only 'TOA' and 'RSSI' for all PRs originating from the same MAC address and containing the same 'PR_IE_data'. All SSIDs from the same MAC address are also stored. The data of the newly detected PR is compared with the already stored data of the same MAC in the current scan time interval. If identical PR's IE data from the same MAC address is already stored, only data for the keys 'TIME' and 'RSSI' are appended. If identical PR's IE data from the same MAC address has not yet been received, then the PR_data structure of the new PR for that MAC address is appended to the 'PROBE_REQs' key. The preprocessing procedure is shown in Figure ./Figures/Preprocessing_procedure.png
At the end of each scan time interval, all processed data is sent to the database along with additional metadata about the collected data, such as the serial number of the wireless gateway and the timestamps for the start and end of the scan. For an example of a single PR capture, see the Single_PR_capture_example.json file.
Folder structure
For ease of processing of the data, the dataset is divided into 7 folders, each containing a 24-hour period. Each folder contains four files, each containing samples from that device.
The folders are named after the start and end time (in UTC). For example, the folder 2022-09-22T22-00-00_2022-09-23T22-00-00 contains samples collected between 23th of September 2022 00:00 local time, until 24th of September 2022 00:00 local time.
Files representing their location via mapping: - 1.json -> location 1 - 2.json -> location 2 - 3.json -> location 3 - 4.json -> location 4
Environments description
The measurements were carried out in the city of Catania, in Piazza Università and Piazza del Duomo The gateway devices (rPIs with WiFi dongle) were set up and gathering data before the start time of this dataset. As of September 23, 2022, the devices were placed in their final configuration and personally checked for correctness of installation and data status of the entire data collection system. Devices were connected either to a nearby Ethernet outlet or via WiFi to the access point provided.
Four Raspbery Pi-s were used: - location 1 -> Piazza del Duomo - Chierici building (balcony near Fontana dell’Amenano) - location 2 -> southernmost window in the building of Via Etnea near Piazza del Duomo - location 3 -> nothernmost window in the building of Via Etnea near Piazza Università - location 4 -> first window top the right of the entrance of the University of Catania
Locations were suggested by the authors and adjusted during deployment based on physical constraints (locations of electrical outlets or internet access) Under ideal circumstances, the locations of the devices and their coverage area would cover both squares and the part of Via Etna between them, with a partial overlap of signal detection. The locations of the gateways are shown in Figure ./Figures/catania.png.
Known dataset shortcomings
Due to technical and physical limitations, the dataset contains some identified deficiencies.
PRs are collected and transmitted in 10-second chunks. Due to the limited capabilites of the recording devices, some time (in the range of seconds) may not be accounted for between chunks if the transmission of the previous packet took too long or an unexpected error occurred.
Every 20 minutes the service is restarted on the recording device. This is a workaround for undefined behavior of the USB WiFi dongle, which can no longer respond. For this reason, up to 20 seconds of data will not be recorded in each 20-minute period.
The devices had a scheduled reboot at 4:00 each day which is shown as missing data of up to a few minutes.
Location 1 - Piazza del Duomo - Chierici
The gateway device (rPi) is located on the second floor balcony and is hardwired to the Ethernet port. This device appears to function stably throughout the data collection period. Its location is constant and is not disturbed, dataset seems to have complete coverage.
Location 2 - Via Etnea - Piazza del Duomo
The device is located inside the building. During working hours (approximately 9:00-17:00), the device was placed on the windowsill. However, the movement of the device cannot be confirmed. As the device was moved back and forth, power outages and internet connection issues occurred. The last three days in the record contain no PRs from this location.
Location 3 - Via Etnea - Piazza Università
Similar to Location 2, the device is placed on the windowsill and moved around by people working in the building. Similar behavior is also observed, e.g., it is placed on the windowsill and moved inside a thick wall when no people are present. This device appears to have been collecting data throughout the whole dataset period.
Location 4 - Piazza Università
This location is wirelessly connected to the access point. The device was placed statically on a windowsill overlooking the square. Due to physical limitations, the device had lost power several times during the deployment. The internet connection was also interrupted sporadically.
Recognitions
The data was collected within the scope of Resiloc project with the help of City of Catania and project partners.
This research study was conducted to analyze the (potential) relationship between hardware and data set sizes. 100 data scientists from France between Jan-2016 and Aug-2016 were interviewed in order to have exploitable data. Therefore, this sample might not be representative of the true population.
What can you do with the data?
I did not find any past research on a similar scale. You are free to play with this data set. For re-usage of this data set out of Kaggle, please contact the author directly on Kaggle (use "Contact User"). Please mention:
Arbitrarily, we chose characteristics to describe Data Scientists and data set sizes.
Data set size:
For the data, it uses the following fields (DS = Data Scientist, W = Workstation):
You should expect potential noise in the data set. It might not be "free" of internal contradictions, as with all researches.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Statistical open data on LAU regions of Slovakia, Czech Republic, Poland, Hungary (and other countries in the future). LAU1 regions are called counties, okres, okresy, powiat, járás, járási, NUTS4, LAU, Local Administrative Units, ... and there are 733 of them in this V4 dataset. Overall, we cover 733 regions which are described by 137.828 observations (panel data rows) and more than 1.760.229 data points.
This LAU dataset contains panel data on population, on age structure of inhabitants, on number and on structure of registered unemployed. Dataset prepared by Michal Páleník. Output files are in json, shapefiles, xls, ods, json, topojson or CSV formats. Downloadable at zenodo.org.
This dataset consists of:
data on unemployment (by gender, education and duration of unemployment),
data on vacancies,
open data on population in Visegrad counties (by age and gender),
data on unemployment share.
Combined latest dataset
dataset of the latest available data on unemployment, vacancies and population
dataset includes map contours (shp, topojson or geojson format), relation id in OpenStreetMap, wikidata entry code,
it also includes NUTS4 code, LAU1 code used by national statistical office and abbreviation of the region (usually license plate),
source of map contours is OpenStreetMap, licensed under ODbL
no time series, only most recent data on population and unemployment combined in one output file
columns: period, lau, name, registered_unemployed, registered_unemployed_females, disponible_unemployed, low_educated, long_term, unemployment_inflow, unemployment_outflow, below_25, over_55, vacancies, pop_period, TOTAL, Y15-64, Y15-64-females, local_lau, osm_id, abbr, wikidata, population_density, area_square_km, way
Slovakia – SK: 79 LAU1 regions, data for 2024-10-01, 1.659 data,
Czech Republic – CZ: 77 LAU1 regions, data for 2024-10-01, 1.617 data,
Poland – PL: 380 LAU1 regions, data for 2024-09-01, 6.840 data,
Hungary – HU: 197 LAU1 regions, data for 2024-10-01, 2.955 data,
13.071 data in total.
column/number of observations description SK CZ PL HU
period period (month and year) the data is for 79 77 380 197
lau LAU code of the region 79 77 380 197
name name of the region in local language 79 77 380 197
registered_unemployed number of unemployed registered at labour offices 79 77 380 197
registered_unemployed_females number of unemployed women 79 77 380 197
disponible_unemployed unemployed able to accept job offer 79 77 0 0
low_educated unmployed without secondary school (ISCED 0 and 1) 79 77 380 197
long_term unemployed for longer than 1 year 79 77 380 0
unemployment_inflow inflow into unemployment 79 77 0 0
unemployment_outflow outflow from unemployment 79 77 0 0
below_25 number of unemployed below 25 years of age 79 77 380 197
over_55 unemployed older than 55 years 79 77 380 197
vacancies number of vacancies reported by labour offices 79 77 380 0
pop_period date of population data 79 77 380 197
TOTAL total population 79 77 380 197
Y15-64 number of people between 15 and 64 years of age, population in economically active age 79 77 380 197
Y15-64-females number of women between 15 and 64 years of age 79 77 380 197
local_lau region's code used by local labour offices 79 77 380 197
osm_id relation id in OpenStreetMap database 79 77 380 197
abbr abbreviation used for this region 79 77 380 0
wikidata wikidata identification code 79 77 380 197
population_density population density 79 77 380 197
area_square_km area of the region in square kilometres 79 77 380 197
way geometry, polygon of given region 79 77 380 197
Unemployment dataset
time series of unemployment data in Visegrad regions
by gender, duration of unemployment, education level, age groups, vacancies,
columns: period, lau, name, registered_unemployed, registered_unemployed_females, disponible_unemployed, low_educated, long_term, unemployment_inflow, unemployment_outflow, below_25, over_55, vacancies
Slovakia – SK: 79 LAU1 regions, data for 334 periods (1997-01-01 ... 2024-10-01), 202.082 data,
Czech Republic – CZ: 77 LAU1 regions, data for 244 periods (2004-07-01 ... 2024-10-01), 147.528 data,
Poland – PL: 380 LAU1 regions, data for 189 periods (2005-03-01 ... 2024-09-01), 314.100 data,
Hungary – HU: 197 LAU1 regions, data for 106 periods (2016-01-01 ... 2024-10-01), 104.408 data,
768.118 data in total.
column/number of observations description SK CZ PL HU
period period (month and year) the data is for 26 386 18 788 71 772 20 882
lau LAU code of the region 26 386 18 788 71 772 20 882
name name of the region in local language 26 386 18 788 71 772 20 882
registered_unemployed number of unemployed registered at labour offices 26 386 18 788 71 772 20 882
registered_unemployed_females number of unemployed women 26 386 18 788 62 676 20 882
disponible_unemployed unemployed able to accept job offer 25 438 18 788 0 0
low_educated unmployed without secondary school (ISCED 0 and 1) 11 771 9855 41 388 20 881
long_term unemployed for longer than 1 year 24 253 9855 41 388 0
unemployment_inflow inflow into unemployment 26 149 16 478 0 0
unemployment_outflow outflow from unemployment 26 149 16 478 0 0
below_25 number of unemployed below 25 years of age 11 929 9855 17 100 20 881
over_55 unemployed older than 55 years 11 929 9855 17 100 20 882
vacancies number of vacancies reported by labour offices 11 692 18 788 62 676 0
Population dataset
time series on population by gender and 5 year age groups in V4 counties
columns: period, lau, name, gender, TOTAL, Y00-04, Y05-09, Y10-14, Y15-19, Y20-24, Y25-29, Y30-34, Y35-39, Y40-44, Y45-49, Y50-54, Y55-59, Y60-64, Y65-69, Y70-74, Y75-79, Y80-84, Y85-89, Y90-94, Y_GE95, Y15-64
Slovakia – SK: 79 LAU1 regions, data for 28 periods (1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023), 152.628 data,
Czech Republic – CZ: 78 LAU1 regions, data for 24 periods (2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023), 125.862 data,
Poland – PL: 382 LAU1 regions, data for 29 periods (1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023), 626.941 data,
Hungary – HU: 197 LAU1 regions, data for 11 periods (2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023), 86.680 data,
992.111 data in total.
column/number of observations description SK CZ PL HU
period period (month and year) the data is for 6636 5574 32 883 4334
lau LAU code of the region 6636 5574 32 883 4334
name name of the region in local language 6636 5574 32 883 4334
gender gender (male or female) 6636 5574 32 883 4334
TOTAL total population 6636 5574 32 503 4334
Y00-04 inhabitants between 00 to 04 years inclusive 6636 5574 32 503 4334
Y05-09 number of inhabitants between 05 to 09 years of age 6636 5574 32 503 4334
Y10-14 number of people between 10 to 14 years inclusive 6636 5574 32 503 4334
Y15-19 number of inhabitants between 15 to 19 years of age 6636 5574 32 503 4334
Y20-24 number of people between 20 to 24 years inclusive 6636 5574 32 503 4334
Y25-29 number of inhabitants between 25 to 29 years of age 6636 5574 32 503 4334
Y30-34 inhabitants between 30 to 34 years inclusive 6636 5574 32 503 4334
Y35-39 number of inhabitants between 35 to 39 years of age 6636 5574 32 503 4334
Y40-44 inhabitants between 40 to 44 years inclusive 6636 5574 32 503 4334
Y45-49 number of inhabitants younger than 49 and older than 45 years 6636 5574 32 503 4334
Y50-54 inhabitants between 50 to 54 years inclusive 6636 5574 32 503 4334
Y55-59 number of inhabitants between 55 to 59 years of age 6636 5574 32 503 4334
Y60-64 inhabitants between 60 to 64 years inclusive 6636 5574 32 503 4334
Y65-69 number of inhabitants younger than 69 and older than 65 years 6636 5574 32 503 4334
Y70-74 inhabitants between 70 to 74 years inclusive 6636 5574 24 670 4334
Y75-79 number of inhabitants between 75 to 79 years of age 6636 5574 24 670 4334
Y80-84 number of people between 80 to 84 years inclusive 6636 5574 24 670 4334
Y85-89 number of inhabitants younger than 89 and older than 85 years 6636 5574 0 0
Y90-94 inhabitants between 90 to 94 years inclusive 6636 5574 0 0
Y_GE95 number of people 95 years or older 6636 3234 0 0
Y15-64 number of people between 15 and 64 years of age, population in economically active age 6636 5574 32 503 4334
Notes
more examples at www.iz.sk
NUTS4 / LAU1 / LAU codes for HU and PL are created by me, so they can (and will) change in the future; CZ and SK NUTS4 codes are used by local statistical offices, so they should be more stable
NUTS4 codes are consistent with NUTS3 codes used by Eurostat
local_lau variable is an identifier used by local statistical office
abbr is abbreviation of region's name, used for map purposes (usually cars' license plate code; except for Hungary)
wikidata is code used by wikidata
osm_id is region's relation number in the OpenStreetMap database
Example outputs
you can download data in CSV, xml, ods, xlsx, shp, SQL, postgis, topojson, geojson or json format at 📥 doi:10.5281/zenodo.6165135
Counties of Slovakia – unemployment rate in Slovak LAU1 regions
Regions of the Slovak Republic
Unemployment of Czechia and Slovakia – unemployment share in LAU1 regions of Slovakia and Czechia
interactive map on unemployment in Slovakia
Slovakia – SK, Czech Republic – CZ, Hungary – HU, Poland – PL, NUTS3 regions of Slovakia
download at 📥 doi:10.5281/zenodo.6165135
suggested citation: Páleník, M. (2024). LAU1 dataset [Data set]. IZ Bratislava. https://doi.org/10.5281/zenodo.6165135
This entities dataset was the output of a project aimed to create a 'gold standard' dataset that could be used to train and validate machine learning approaches to natural language processing (NLP). The project was carried out by Aleph Insights and Committed Software on behalf of the Defence Science and Technology Laboratory (Dstl). The data set specifically focusing on entity and relationship extraction relevant to somebody operating in the role of a defence and security intelligence analyst. The dataset was therefore constructed using documents and structured schemas that were relevant to the defence and security analysis domain. A number of data subsets were produced (this is the BBC Online data subset). Further information about this data subset (BBC Online) and the others produced (together with licence conditions, attribution and schemas) many be found at the main project GitHub repository webpage (https://github.com/dstl/re3d). Note that the 'entities.json' file is to be used together with the 'documents.json' and 'relations.json' files (also found on this data.gov.uk webpage and their structures and relationship described on the given GitHub webpage.
From the Web site: The Post gained access to the Drug Enforcement Administration’s Automation of Reports and Consolidated Orders System, known as ARCOS, as the result of a court order. The Post and HD Media, which publishes the Charleston Gazette-Mail in West Virginia, waged a year-long legal battle for access to the database, which the government and the drug industry had sought to keep secret.
The version of the database published by The Post allows readers to learn how much hydrocodone and oxycodone went to individual states and counties, and which companies and distributors were responsible.
Also: Guidelines for using this data Fill out the form below to establish a connection with our team and report any issues downloading the data. This will also allow us to update you with any additional information as it comes out and answer questions you may have. Because of the volume of requests, we ask you use this channel rather than emailing our reporters individually. If you publish an online story, graphic, map or other piece of journalism based on this data set, please credit The Washington Post, link to the original source, and send us an email when you’ve hit publish. We want to learn what you discover and will attempt to link to your work as part of cataloguing the impact of this project. Post reporting and graphics can be used on-air. We ask for oral or on-screen credit to The Washington Post. For specific requests, including interview with Post journalists, please email postpr@washpost.com.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction Generation customers connected to UK Power Networks can be subjected to curtailment through our Distributed Energy Resource Management System (DERMS) if they accepted a curtailable-connection. During periods of network congestion, these DERS will have their access reduced to mitigate network constraint breaches. Their reduction is organised according to their connection application date in a last-in first-out (LIFO) arrangement. The Constraints Real Time Meter Readings dataset on the Open Data Portal (ODP) gives a near real time status of the constraints on our network that are used by DERMS to reduce access. This API accessible dataset can be used to see just how congested the network is, and for the specific DER operators themselves, they have access and visibility to the constraints of their specific site. The dataset contains a timestamp, the constraint identifier, the most recent current reading in amps, the trim and release limits (curtailment starts at the trim and ends at the release), whether the site is in breach, a description of the constraint, and (only if you have access) the name of the DER. The dataset updates as close to real time as is possible. Our scheduling is as follows:
At 15s past the minute mark, we scrape the network data and push it to the ODP server On the minute mark, the ODP runs an update to refresh the dataset The dataset refresh is completed between 5-15s past the minute mark Only after this refresh has completed can you get the latest values from the ODP
You can run this notebook to see the dataset in action: https://colab.research.google.com/drive/1Czx98U6zttlA3PC2OfI_0UzAbE48BvEq?usp=sharing
Methodological Approach
A Remote Terminal Unit (RTU) is installed at each curtailable-connection site providing live telemetry data into the DERMS. It measures communications status, generator output, and mode of operation. RTUs are also installed at constraint locations (physical parts of the network, e.g., transformers, cables which may become overloaded under certain conditions). These are identified through planning power load studies. These RTUs monitor current at the constraint and communications status. The DERMS design integrates network topology information. This maps constraints to associated curtailable connections under different network running conditions, including the sensitivity of the constraints to each curtailable connection. In general, a 1MW reduction in generation of a customer will cause <1MW reduction at the constraint. Each constraint is registered to a GSP. DERMS monitors constraints against the associated breach limit. When a constraint limit is breached, DERMS calculates the amount of access reduction required from curtailable connections linked to the constraint to alleviate the breach. This calculation factors in the real-time level of generation of each customer and the sensitivity of the constraint to each generator. Access reduction is issued to each curtailable-connection via the RTU until the constraint limit breach is mitigated. Multiple constraints can apply to a curtailable-connection and constraint breaches can occur simultaneously. Where multiple constraint breaches act upon a single curtailable-connection, we apportion the access reduction of that connection to the constraint breaches depending on the relative magnitude of the breaches. Where customer curtailment occurs without any associated constraint breach, we categorize the curtailment as non-constraint driven. Future developments will include the reason for non-constraint driven curtailment.
Quality Control Statement Quality Control Measures include:
Manual review and correction of data inconsistencies. Use of additional verification steps to ensure accuracy in the methodology.
Assurance Statement The DSO Data Science Team checked to ensure data accuracy and consistency.
Other Download dataset information: Metadata (JSON) Definitions of key terms related to this dataset can be found in the Open Data Portal Glossary: https://ukpowernetworks.opendatasoft.com/pages/glossary/
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The input-output table is comprehensive and detailed in describing the national economic system with complex economic relationships, which embodies information of supply and demand among industrial sectors. This paper aims to scale the degree of competition/collaboration on the global value chain from the perspective of econophysics. Global Industrial Strongest Relevant Network models were established by extracting the strongest and most immediate industrial relevance in the global economic system with inter-country input-output tables and then transformed into Global Industrial Resource Competition Network/Global Industrial Production Collaboration Network models embodying the competitive/collaborative relationships based on bibliographic coupling/co-citation approach. Three indicators well suited for these two kinds of weighted and non-directed networks with self-loops were introduced, including unit weight for competitive/collaborative power, disparity in the weight for competitive/collaborative amplitude and weighted clustering coefficient for competitive/collaborative intensity. Finally, these models and indicators were further applied to empirically analyze the function of sectors in the latest World Input-Output Database, to reveal inter-sector competitive/collaborative status during the economic globalization.
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Datasets Description:
The datasets under discussion pertain to the red and white variants of Portuguese "Vinho Verde" wine. Detailed information is available in the reference by Cortez et al. (2009). These datasets encompass physicochemical variables as inputs and sensory variables as outputs. Notably, specifics regarding grape types, wine brand, and selling prices are absent due to privacy and logistical concerns.
Classification and Regression Tasks: One can interpret these datasets as being suitable for both classification and regression analyses. The classes are ordered, albeit imbalanced. For instance, the dataset contains a more significant number of normal wines compared to excellent or poor ones.
Dataset Contents: For a comprehensive understanding, readers are encouraged to review the work by Cortez et al. (2009). The input variables, derived from physicochemical tests, include: 1. Fixed acidity 2. Volatile acidity 3. Citric acid 4. Residual sugar 5. Chlorides 6. Free sulfur dioxide 7. Total sulfur dioxide 8. Density 9. pH 10. Sulphates 11. Alcohol
The output variable, based on sensory data, is denoted by: 12. Quality (score ranging from 0 to 10)
Usage Tips: A practical suggestion involves setting a threshold for the dependent variable, defining wines with a quality score of 7 or higher as 'good/1' and the rest as 'not good/0.' This facilitates meaningful experimentation with hyperparameter tuning using decision tree algorithms and analyzing ROC curves and AUC values.
Operational Workflow: To efficiently utilize the dataset, the following steps are recommended: 1. Utilize a File Reader (for csv) to a linear correlation node and an interactive histogram for basic Exploratory Data Analysis (EDA). 2. Employ a File Reader to a Rule Engine Node for transforming the 10-point scale to a dichotomous variable indicating 'good wine' and 'rest.' 3. Implement a Rule Engine Node output to an input of Column Filter node to filter out the original 10-point feature, thus preventing data leakage. 4. Apply a Column Filter Node output to the input of Partitioning Node to execute a standard train/test split (e.g., 75%/25%, choosing 'random' or 'stratified'). 5. Feed the Partitioning Node train data split output into the input of Decision Tree Learner node. 6. Connect the Partitioning Node test data split output to the input of Decision Tree predictor Node. 7. Link the Decision Tree Learner Node output to the input of Decision Tree Node. 8. Finally, connect the Decision Tree output to the input of ROC Node for model evaluation based on the AUC value.
Tools and Acknowledgments: For an efficient analysis, consider using KNIME, a valuable graphical user interface (GUI) tool. Additionally, the dataset is available on the UCI machine learning repository, and proper acknowledgment and citation of the dataset source by Cortez et al. (2009) are essential for use.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Atticus Open Contract Dataset (AOK)(beta) is a corpus of 5,000+ labels in 200 commercial legal contracts that have been manually labeled by legal experts to identify 40 types of clauses that are important during contract review in connection with corporate transactions, such as mergers and acquisitions, IPO, and corporate financing.AOK Dataset is curated and maintained by The Atticus Project, Inc., a non-profit organization, to support NLP research and development in legal contract review. If you download this dataset, we'd love to know more about you and your project! Please fill out this short form: https://forms.gle/h47GUENTTbBqH39m7
Check out our website at atticusprojectai.org.
Update: The expanded 1.0 version of the dataset is available here https://zenodo.org/record/4595826
This packaged data collection contains two sets of two additional model runs that used the same inputs and parameters as our primary model, with the exception being we implemented a "maximum corridor length" constraint that allowed us to identify and visualize the corridors as being well-connected (≤15km) or moderately connected (≤45km). This is based on an assumption that corridors longer than 45km are too long to sufficiently accommodate dispersal. One of these sets is based on a maximum corridor length that uses Euclidean (straight-line) distance, while the other set is based on a maximum corridor length that uses cost-weighted distance. These two sets of corridors can be compared against the full set of corridors from our primary model to identify the remaining corridors, which could be considered poorly connected. This package includes the following data layers: Corridors classified as well connected (≤15km) based on Cost-weighted Distance Corridors classified as moderately connected (≤45km) based on Cost-weighted Distance Corridors classified as well connected (≤15km) based on Euclidean Distance Corridors classified as moderately connected (≤45km) based on Euclidean Distance Please refer to the embedded metadata and the information in our full report for details on the development of these data layers. Packaged data are available in two formats: Geodatabase (.gdb): A related set of file geodatabase rasters and feature classes, packaged in an ESRI file geodatabase. ArcGIS Pro Map Package (.mpkx): The same data included in the geodatabase, presented as fully-symbolized layers in a map. Note that you must have ArcGIS Pro version 2.0 or greater to view. See Cross-References for links to individual datasets, which can be downloaded in raster GeoTIFF (.tif) format.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Article Information
The work involved in developing the dataset and benchmarking its use of machine learning is set out in the article ‘IoMT-TrafficData: Dataset and Tools for Benchmarking Intrusion Detection in Internet of Medical Things’. DOI: 10.1109/ACCESS.2024.3437214.
Please do cite the aforementioned article when using this dataset.
Abstract
The increasing importance of securing the Internet of Medical Things (IoMT) due to its vulnerabilities to cyber-attacks highlights the need for an effective intrusion detection system (IDS). In this study, our main objective was to develop a Machine Learning Model for the IoMT to enhance the security of medical devices and protect patients’ private data. To address this issue, we built a scenario that utilised the Internet of Things (IoT) and IoMT devices to simulate real-world attacks. We collected and cleaned data, pre-processed it, and provided it into our machine-learning model to detect intrusions in the network. Our results revealed significant improvements in all performance metrics, indicating robustness and reproducibility in real-world scenarios. This research has implications in the context of IoMT and cybersecurity, as it helps mitigate vulnerabilities and lowers the number of breaches occurring with the rapid growth of IoMT devices. The use of machine learning algorithms for intrusion detection systems is essential, and our study provides valuable insights and a road map for future research and the deployment of such systems in live environments. By implementing our findings, we can contribute to a safer and more secure IoMT ecosystem, safeguarding patient privacy and ensuring the integrity of medical data.
ZIP Folder Content
The ZIP folder comprises two main components: Captures and Datasets. Within the captures folder, we have included all the captures used in this project. These captures are organized into separate folders corresponding to the type of network analysis: BLE or IP-Based. Similarly, the datasets folder follows a similar organizational approach. It contains datasets categorized by type: BLE, IP-Based Packet, and IP-Based Flows.
To cater to diverse analytical needs, the datasets are provided in two formats: CSV (Comma-Separated Values) and pickle. The CSV format facilitates seamless integration with various data analysis tools, while the pickle format preserves the intricate structures and relationships within the dataset.
This organization enables researchers to easily locate and utilize the specific captures and datasets they require, based on their preferred network analysis type or dataset type. The availability of different formats further enhances the flexibility and usability of the provided data.
Datasets' Content
Within this dataset, three sub-datasets are available, namely BLE, IP-Based Packet, and IP-Based Flows. Below is a table of the features selected for each dataset and consequently used in the evaluation model within the provided work.
Identified Key Features Within Bluetooth Dataset
Feature Meaning
btle.advertising_header BLE Advertising Packet Header
btle.advertising_header.ch_sel BLE Advertising Channel Selection Algorithm
btle.advertising_header.length BLE Advertising Length
btle.advertising_header.pdu_type BLE Advertising PDU Type
btle.advertising_header.randomized_rx BLE Advertising Rx Address
btle.advertising_header.randomized_tx BLE Advertising Tx Address
btle.advertising_header.rfu.1 Reserved For Future 1
btle.advertising_header.rfu.2 Reserved For Future 2
btle.advertising_header.rfu.3 Reserved For Future 3
btle.advertising_header.rfu.4 Reserved For Future 4
btle.control.instant Instant Value Within a BLE Control Packet
btle.crc.incorrect Incorrect CRC
btle.extended_advertising Advertiser Data Information
btle.extended_advertising.did Advertiser Data Identifier
btle.extended_advertising.sid Advertiser Set Identifier
btle.length BLE Length
frame.cap_len Frame Length Stored Into the Capture File
frame.interface_id Interface ID
frame.len Frame Length Wire
nordic_ble.board_id Board ID
nordic_ble.channel Channel Index
nordic_ble.crcok Indicates if CRC is Correct
nordic_ble.flags Flags
nordic_ble.packet_counter Packet Counter
nordic_ble.packet_time Packet time (start to end)
nordic_ble.phy PHY
nordic_ble.protover Protocol Version
Identified Key Features Within IP-Based Packets Dataset
Feature Meaning
http.content_length Length of content in an HTTP response
http.request HTTP request being made
http.response.code Sequential number of an HTTP response
http.response_number Sequential number of an HTTP response
http.time Time taken for an HTTP transaction
tcp.analysis.initial_rtt Initial round-trip time for TCP connection
tcp.connection.fin TCP connection termination with a FIN flag
tcp.connection.syn TCP connection initiation with SYN flag
tcp.connection.synack TCP connection establishment with SYN-ACK flags
tcp.flags.cwr Congestion Window Reduced flag in TCP
tcp.flags.ecn Explicit Congestion Notification flag in TCP
tcp.flags.fin FIN flag in TCP
tcp.flags.ns Nonce Sum flag in TCP
tcp.flags.res Reserved flags in TCP
tcp.flags.syn SYN flag in TCP
tcp.flags.urg Urgent flag in TCP
tcp.urgent_pointer Pointer to urgent data in TCP
ip.frag_offset Fragment offset in IP packets
eth.dst.ig Ethernet destination is in the internal network group
eth.src.ig Ethernet source is in the internal network group
eth.src.lg Ethernet source is in the local network group
eth.src_not_group Ethernet source is not in any network group
arp.isannouncement Indicates if an ARP message is an announcement
Identified Key Features Within IP-Based Flows Dataset
Feature Meaning
proto Transport layer protocol of the connection
service Identification of an application protocol
orig_bytes Originator payload bytes
resp_bytes Responder payload bytes
history Connection state history
orig_pkts Originator sent packets
resp_pkts Responder sent packets
flow_duration Length of the flow in seconds
fwd_pkts_tot Forward packets total
bwd_pkts_tot Backward packets total
fwd_data_pkts_tot Forward data packets total
bwd_data_pkts_tot Backward data packets total
fwd_pkts_per_sec Forward packets per second
bwd_pkts_per_sec Backward packets per second
flow_pkts_per_sec Flow packets per second
fwd_header_size Forward header bytes
bwd_header_size Backward header bytes
fwd_pkts_payload Forward payload bytes
bwd_pkts_payload Backward payload bytes
flow_pkts_payload Flow payload bytes
fwd_iat Forward inter-arrival time
bwd_iat Backward inter-arrival time
flow_iat Flow inter-arrival time
active Flow active duration
These datasets are presented in the article "AYNEC: All You Need for Evaluating Completion Techniques in Knowledge Graphs", sent for the ESWC19. Please, cite it in your work if you make use of them. The following datasets are included: WN18-AF, generated from WN18. WN18-AR, generated from WN18, removing inverses. WN11-AF, generated from WN11. WN11-AR, generated from WN11, removing inverses. FB13-A, generated from FB13. FB15K-AF, generated from FB15K. FB15K-AR, generated from FB15K, keeping relations that cover 95% of the graph and removing inverses. NELL-AF, generated from NELL. NELL-AR, generated from NELL, keeping relations that cover 95% of the graph and removing inverses. In all datasets, we removed relations with only one instance, used 20% of each relation in the graph for test, generated one negative for each positive in both training and testing by replacing the target of the positive with a random entity. In WN11 and WN18 all entities are potential candidates. In the rest of datasets, only entities that have appeared as targets of the relation are candidates. Two relations were considered inverses when there was a 90% overlap between them. That is, relationc A and B are inverses if for 90% of instances of A there is an instance of B with inversed source and target, and vice-versa. When removing inverses, the smallest of each pair of inverses was removed. Each zip file contains the following files about a dataset: train.txt - triples used for training. Each line contains the source, the relation, the target, and the label (1 for positives and -1 for negatives). test.txt - triples used for testing, following the same format. relations.txt - a list of the relations in the dataset, each with its frequency. entities.txt - a list of the entities in the dataset, eac with its total degree, inwards degree, and output degree. inverses.txt - a list of the inverses in the original graph, whether or not they were removed. Each inverse relationship is represented by a pair of relations. summary.html - the visual summary of the relation frequencies and entity degrees (without removed inverses). dataset.gexf - the entire dataset in the open graph format "gexf", which can be opened by applications such as Gephi. {"references": ["Ayala, D., Borrego, A., Hern\u00e1ndez, I., Rivero, C. R., & Ruiz, D. (2019, June). AYNEC: All You Need for Evaluating Completion Techniques in Knowledge Graphs. In European Semantic Web Conference (pp. 397-411). Springer, Cham."]}
The QoG Institute is an independent research institute within the Department of Political Science at the University of Gothenburg. Overall 30 researchers conduct and promote research on the causes, consequences and nature of Good Governance and the Quality of Government - that is, trustworthy, reliable, impartial, uncorrupted and competent government institutions.
The main objective of our research is to address the theoretical and empirical problem of how political institutions of high quality can be created and maintained. A second objective is to study the effects of Quality of Government on a number of policy areas, such as health, the environment, social policy, and poverty.
The dataset was created as part of a research project titled “Quality of Government and the Conditions for Sustainable Social Policy”. The aim of the dataset is to promote cross-national comparative research on social policy output and its correlates, with a special focus on the connection between social policy and Quality of Government (QoG).
The data comes in three versions: one cross-sectional dataset, and two cross-sectional time-series datasets for a selection of countries. The two combined datasets are called “long” (year 1946-2009) and “wide” (year 1970-2005).
The data contains six types of variables, each provided under its own heading in the codebook: Social policy variables, Tax system variables, Social Conditions, Public opinion data, Political indicators, Quality of government variables.
QoG Social Policy Dataset can be downloaded from the Data Archive of the QoG Institute at http://qog.pol.gu.se/data/datadownloads/data-archive Its variables are now included in QoG Standard.
Purpose:
The primary aim of QoG is to conduct and promote research on corruption. One aim of the QoG Institute is to make publicly available cross-national comparative data on QoG and its correlates. The aim of the QoG Social Policy Dataset is to promote cross-national comparative research on social policy output and its correlates, with a special focus on the connection between social policy and Quality of Government (QoG).
The dataset combining cross-sectional data and time-series data for a selection of 40 countries. The dataset is specifically tailored for the analysis of public opinion data over time, instead uses country as its unit of observation, and one variable for every 5th year from 1970-2005 (or, one per module of each public opinion data source).
Samanni, Marcus. Jan Teorell, Staffan Kumlin, Stefan Dahlberg, Bo Rothstein, Sören Holmberg & Richard Svensson. 2012. The QoG Social Policy Dataset, version 4Apr12. University of Gothenburg:The Quality of Government Institute. http://www.qog.pol.gu.se
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
arXiv: https://arxiv.org/abs/2312.09753
To construct the MORE dataset, we choose to use multimodal news data as a source rather than annotating existing MRE datasets primarily sourced from social media. Multimodal news data has selective and well-edited images and textual titles, resulting in relatively good data quality, and often contains timely and informative knowledge. We obtained the data from The New York Times English news and Yahoo News from 2019 to 2022, resulting in a candidate set of 15,000 multimodal news data instances covering various topics. We filtered out unqualified data and obtained a meticulously selected dataset for our research purposes. Then the candidate multimodal news was annotated in the following three distinct stages.
Stage 1: Entity Identification and Object Detection. We utilized the AllenNLP named entity recognition tool1 and the Yolo V5 object detection tool2 to identify the entities in textual news titles and the object areas in the corresponding news images. All extracted objects and entities were reviewed and corrected manually by our annotators.
Stage 2: Object-Entity Relation Annotation. We recruited well-educated annotators to examine the textual titles and images and deduce the relations between the entities and objects. Relations were randomly assigned to annotators from the candidate set to ensure an unbiased annotation process. The data did not clearly indicate any pre-defined relations will be labeled as none. At least two annotators are required to independently review and annotate each data. In cases where there were discrepancies or conflicts in the annotations, a third annotator was consulted, and their decision was considered final. The weighted Cohen's Kappa is used to measure the consistency between different annotators.
Stage 3: Object-Overlapped Data Filtering. To refine the scope of multimodal object-entity relation extraction task, we only focused on relations in which visual objects did not co-occur with any entities mentioned in the textual news titles. This process filtered down the data from 15,000 to over 3,000 articles containing more than 20,000 object-entity relational facts. This approach ensured a dataset of only relatable object-entity relationships illustrated in images, rather than those that were already mentioned explicitly in the textual news titles, resulting in a more focused dataset for the task.
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
The IDMT-SMT-GUITAR database is a large database for automatic guitar transcription. Seven different guitars in standard tuning were used with varying pick-up settings and different string measures to ensure a sufficient diversification in the field of electric and acoustic guitars. The recording setup consisted of appropriate audio interfaces, which were directly connected to the guitar output or in one case to a condenser microphone. The recordings are provided in one channel RIFF WAVE format with 44100 Hz sample rate.
The dataset consists of four subsets. The first contains all introduced playing techniques (plucking styles: finger-style, muted, picked; expression styles: normal, bending, slide, vibrato, harmonics, dead-notes) and is provided with a bit depth of 24 Bit. It has been recorded using three different guitars and consists of about 4700 note events with monophonic and polyphonic structure. As a particularity the recorded files contain realistic guitar licks ranging from monophonic to polyphonic instrument tracks.
The second subset of data consists of 400 monophonic and polyphonic note events each played with two different guitars. No expression styles were applied here and each note event was recorded and stored in a separate file with a bit depth of 16 Bit. The parameter annotations for the first and second subset are stored in XML format.
The third subset is made up of five short monophonic and polyphonic guitar recordings. All five pieces have been recorded with the same instrument and no special expression styles were applied. The files are stored with a bit depth of 16 Bit and each file is accompanied by a parameter annotation in XML format.
Additionally, a fourth subset is included, which was created for evaluation purposes in the context of chord recognition and rhythm style estimation tasks. This set contains recordings of 64 short musical pieces grouped by genre. Each piece has been recorded at two different tempi with three different guitars and is provided with a bit depth of 16 Bit. Annotations regarding onset positions, chords, rhythmic pattern length, and texture (monophony/polyphony) are included in various file formats.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Fitness Trends Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/aroojanwarkhan/fitness-data-trends on 28 January 2022.
--- Dataset description provided by original source is as follows ---
The motivation behind collecting this data-set was personal, with the objective of answering a simple question, “does exercise/working-out improve a person’s activeness?”. For the scope of this project a person’s activeness was the measure of their daily step-count (the number of steps they take in a day). Mood was measured in either "Happy", "Neutral" or "Sad" which were given numeric values of 300, 200 and 100 respectively. Feeling of activeness was measured in either "Active" or "Inactive" which were given numeric values of 500 and 0 respectively. I had noticed for a while that during the months when I was exercising regularly I felt more active and would move around a lot more. As opposed to when I was not working out, i would feel lethargic. I wanted to know for sure what the connection between exercise and activeness was. I started compiling the data on 6th October with the help Samsung Health application that was recording my daily step count and the number of calories burned. The purpose of the project was to establish through two sets of data (control and experimental) if working-out/exercise promotes an increase in the daily step-count or not.
Date Step Count Calories Burned Mood Hours of Sleep Feeling or Activeness or Inactiveness Weight
Special thanks to Samsung Health that contributed to the set by providing daily step count and the number of calories burned.
"Does exercise/working-out improve a person’s activeness?”
--- Original source retains full ownership of the source dataset ---
This work was conducted by the Diverse Rotations Improve Valuable Ecosystem Services (DRIVES) project, based in the USDA-ARS Sustainable Agricultural Systems Lab in Beltsville, MD. The DRIVES team compiled a database of 20-plus long-term cropping systems experiments in North America in order to conduct cross-site research. This repository contains all scripts from our first research paper from the DRIVES database: "Rotational complexity increases cropping system output under poorer growing conditions," published in One Earth (in press). This analysis uses crop yield and experimental design data from the DRIVES database and public data sources for crop prices and inflation. This repository includes limited datasets derived from public sources or lacking connection to site IDs. We do not have permission to share the full primary dataset, but can provide data upon request with permission from site contacts.The scripts show all data setup, analysis, and visualization steps used to investigate how crop rotation diversity (defined by rotation length and the number of species) impacts productivity of whole rotations and component crops under varying growing conditions. We used Bayesian multilevel modeling fit to data from 20 long-term cropping systems datasets in North America (434 site-years, 36,000 observations). Rotation- and crop-level productivity were quantified as dollar output, using price coefficients derived from National Agriculture Statistics Service (NASS) price data (included in repository). Growing condtions were quantified using an Environmental Index calculated from site-year average output. Bayesian multilevel models were implemented using the 'brms' R package, which is a wrapper for Stan. Descriptions of all files are included in README.pdf.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The open access graph dataset that shows the connections between Dryad, CERN, ANDS and other international data repositories to publications and grants across multiple research data infrastructures. The graph dataset was created using the Research Graph data model and the Research Data Switchboard (RD-Switchboard), a collaborative project by the Research Data Alliance DDRI Working Group (DDRI WG) with the aim to discover and connect the related research datasets based on publication co-authorship or jointly funded grants.