Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Author: Andrew J. Felton
Date: 10/29/2024
This R project contains the primary code and data (following pre-processing in python) used for data production, manipulation, visualization, and analysis, and figure production for the study entitled:
"Global estimates of the storage and transit time of water through vegetation"
Please note that 'turnover' and 'transit' are used interchangeably. Also please note that this R project has been updated multiple times as the analysis has updated.
Data information:
The data folder contains key data sets used for analysis. In particular:
"data/turnover_from_python/updated/august_2024_lc/" contains the core datasets used in this study including global arrays summarizing five year (2016-2020) averages of mean (annual) and minimum (monthly) transit time, storage, canopy transpiration, and number of months of data able as both an array (.nc) or data table (.csv). These data were produced in python using the python scripts found in the "supporting_code" folder. The remaining files in the "data" and "data/supporting_data"" folder primarily contain ground-based estimates of storage and transit found in public databases or through a literature search, but have been extensively processed and filtered here. The "supporting_data"" folder also contains annual (2016-2020) MODIS land cover data used in the analysis and contains separate filters containing the original data (.hdf) and then the final process (filtered) data in .nc format. The resulting annual land cover distributions were used in the pre-processing of data in python.
#Code information
Python scripts can be found in the "supporting_code" folder.
Each R script in this project has a role:
"01_start.R": This script sets the working directory, loads in the tidyverse package (the remaining packages in this project are called using the `::` operator), and can run two other scripts: one that loads the customized functions (02_functions.R) and one for importing and processing the key dataset for this analysis (03_import_data.R).
"02_functions.R": This script contains custom functions. Load this using the
`source()` function in the 01_start.R script.
"03_import_data.R": This script imports and processes the .csv transit data. It joins the mean (annual) transit time data with the minimum (monthly) transit data to generate one dataset for analysis: annual_turnover_2. Load this using the
`source()` function in the 01_start.R script.
"04_figures_tables.R": This is the main workhouse for figure/table production and
supporting analyses. This script generates the key figures and summary statistics
used in the study that then get saved in the manuscript_figures folder. Note that all
maps were produced using Python code found in the "supporting_code"" folder.
"supporting_generate_data.R": This script processes supporting data used in the analysis, primarily the varying ground-based datasets of leaf water content.
"supporting_process_land_cover.R": This takes annual MODIS land cover distributions and processes them through a multi-step filtering process so that they can be used in preprocessing of datasets in python.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset was developed by NREL's distributed energy systems integration group as part of a study on high penetrations of distributed solar PV [1]. It consists of hourly load data in CSV format for use with the PNNL taxonomy of distribution feeders [2]. These feeders were developed in the open source GridLAB-D modelling language [3]. In this dataset each of the load points in the taxonomy feeders is populated with hourly averaged load data from a utility in the feeder’s geographical region, scaled and randomized to emulate real load profiles. For more information on the scaling and randomization process, see [1].
The taxonomy feeders are statistically representative of the various types of distribution feeders found in five geographical regions of the U.S. Efforts are underway (possibly complete) to translate these feeders into the OpenDSS modelling language.
This data set consists of one large CSV file for each feeder. Within each CSV, each column represents one load bus on the feeder. The header row lists the name of the load bus. The subsequent 8760 rows represent the loads for each hour of the year. The loads were scaled and randomized using a Python script, so each load series represents only one of many possible randomizations. In the header row, "rl" = residential load and "cl" = commercial load. Commercial loads are followed by a phase letter (A, B, or C). For regions 1-3, the data is from 2009. For regions 4-5, the data is from 2000.
For use in GridLAB-D, each column will need to be separated into its own CSV file without a header. The load value goes in the second column, and corresponding datetime values go in the first column, as shown in the sample file, sample_individual_load_file.csv. Only the first value in the time column needs to written as an absolute time; subsequent times may be written in relative format (i.e. "+1h", as in the sample). The load should be written in P+Qj format, as seen in the sample CSV, in units of Watts (W) and Volt-amps reactive (VAr). This dataset was derived from metered load data and hence includes only real power; reactive power can be generated by assuming an appropriate power factor. These loads were used with GridLAB-D version 2.2.
Browse files in this dataset, accessible as individual files and as a single ZIP file. This dataset is approximately 242MB compressed or 475MB uncompressed.
For questions about this dataset, contact andy.hoke@nrel.gov.
If you find this dataset useful, please mention NREL and cite [1] in your work.
References:
[1] A. Hoke, R. Butler, J. Hambrick, and B. Kroposki, “Steady-State Analysis of Maximum Photovoltaic Penetration Levels on Typical Distribution Feeders,” IEEE Transactions on Sustainable Energy, April 2013, available at http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6357275 .
[2] K. Schneider, D. P. Chassin, R. Pratt, D. Engel, and S. Thompson, “Modern Grid Initiative Distribution Taxonomy Final Report”, PNNL, Nov. 2008. Accessed April 27, 2012: http://www.gridlabd.org/models/feeders/taxonomy of prototypical feeders.pdf
[3] K. Schneider, D. Chassin, Y. Pratt, and J. C. Fuller, “Distribution power flow for smart grid technologies”, IEEE/PES Power Systems Conference and Exposition, Seattle, WA, Mar. 2009, pp. 1-7, 15-18.
The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. The classes are completely mutually exclusive. There are 50000 training images and 10000 test images.
The batches.meta file contains the label names of each class.
The dataset was originally divided in 5 training batches with 10000 images per batch. The original dataset can be found here: https://www.cs.toronto.edu/~kriz/cifar.html. This dataset contains all the training data and test data in the same CSV file so it is easier to load.
Here is the list of the 10 classes in the CIFAR-10:
Classes: 1) 0: airplane 2) 1: automobile 3) 2: bird 4) 3: cat 5) 4: deer 6) 5: dog 7) 6: frog 8) 7: horse 9) 8: ship 10) 9: truck
The function used to open the file:
def unpickle(file):
import pickle
with open(file, 'rb') as fo:
dict = pickle.load(fo, encoding='bytes')
return dict
Example of how to read the file:
metadata_path = './cifar-10-python/batches.meta' # change this path
metadata = unpickle(metadata_path)
Dataset Card for Python-DPO
This dataset is the smaller version of Python-DPO-Large dataset and has been created using Argilla.
Load with datasets
To load this dataset with datasets, you'll just need to install datasets as pip install datasets --upgrade and then use the following code: from datasets import load_dataset
ds = load_dataset("NextWealth/Python-DPO")
Data Fields
Each data instance contains:
instruction: The problem description/requirements… See the full description on the dataset page: https://huggingface.co/datasets/NextWealth/Python-DPO.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data inputted in the simulation were generated by two Python scripts: "GENERATE_SAMPLES.py" and "GENERATE_RESAMPLING_DATA.py".1. "GENERATE_SAMPLES.py": In this Python script, we aim to generate a) "DataSet_n[N]_p[p].pickle" where N is replaced by 500 or 5000, p is replaced by 2 or 10. This Python object contains: a1. the explicative variables "X", a2. the responses "Y", a3. the knots "knots", a4. the target tail index parameters "gamma0", a5. the k-different ranndom state responses "Yk" with k=1,..,100. To read these data, you should run the following python code (take n=5000 and p=10 for example) import pickle with open('DataSet_n5000_p10.pickle', 'rb') as handle: X = pickle.load(handle) Y = pickle.load(handle) knots = pickle.load(handle) gamma0 = pickle.load(handle) Yk = pickle.load(handle) b) "gridX_p[p].picke" where p is replaced by 2 or 10. This Python object contains: b1. the setting points "gridX" which correspond to (x(1)_(m1),...,x(p)_(mp)) in the paper, b2. "prefactor" corresponds to \Delta(p)x in the paper b3. "gamma0_gridX corresponds to gamma0(gridX) To read these data, you should run the following python code (take p=10 for example) import pickle with open('gridX_p10.pickle', 'rb') as handle: gridX = pickle.load(handle) prefactor = pickle.load(handle) gamma0_gridX = pickle.load(handle)2. "GENERATE_RESAMPLING_DATA.py": In this Python script, we aim to generate: a) "DataSet_Resampling_n[N]_p[p]_w_replacement.pickle" where N is replaced by 500 or 5000, p is replaced by 2 or 10. This Python object contains: a1. the resampling explicative variables "X_resample", a2. the knots "knots", a3. the resampling k-different random state response "Y_resample". To read these data, you should run the following python code (take N=5000 and p=10 for example) import pickle with open('DataSet_Resampling_n5000_p10_w_replacement.pickle', 'rb') as handle: X_resample = pickle.load(handle) ignored = pickle.load(handle) Y_resample = pickle.load(handle)
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Open Context (https://opencontext.org) publishes free and open access research data for archaeology and related disciplines. An open source (but bespoke) Django (Python) application supports these data publishing services. The software repository is here: https://github.com/ekansa/open-context-py
The Open Context team runs ETL (extract, transform, load) workflows to import data contributed by researchers from various source relational databases and spreadsheets. Open Context uses PostgreSQL (https://www.postgresql.org) relational database to manage these imported data in a graph style schema. The Open Context Python application interacts with the PostgreSQL database via the Django Object-Relational-Model (ORM).
This database dump includes all published structured data organized used by Open Context (table names that start with 'oc_all_'). The binary media files referenced by these structured data records are stored elsewhere. Binary media files for some projects, still in preparation, are not yet archived with long term digital repositories.
These data comprehensively reflect the structured data currently published and publicly available on Open Context. Other data (such as user and group information) used to run the Website are not included.
IMPORTANT
This database dump contains data from roughly 190+ different projects. Each project dataset has its own metadata and citation expectations. If you use these data, you must cite each data contributor appropriately, not just this Zenodo archived database dump.
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials
Background
This dataset contains data from monotonic and cyclic loading experiments on structural metallic materials. The materials are primarily structural steels and one iron-based shape memory alloy is also included. Summary files are included that provide an overview of the database and data from the individual experiments is also included.
The files included in the database are outlined below and the format of the files is briefly described. Additional information regarding the formatting can be found through the post-processing library (https://github.com/ahartloper/rlmtp/tree/master/protocols).
Usage
Included Files
File Format: Downsampled Data
These are the "LP_
These data files can be easily loaded using the pandas library in Python through:
import pandas
data = pandas.read_csv(data_file, index_col=0)
The data is formatted so it can be used directly in RESSPyLab (https://github.com/AlbanoCastroSousa/RESSPyLab). Note that the column names "e_true" and "Sigma_true" were kept for backwards compatibility reasons with RESSPyLab.
File Format: Unreduced Data
These are the "LP_
The data can be loaded and used similarly to the downsampled data.
File Format: Overall_Summary
The overall summary file provides data on all the test specimens in the database. The columns include:
File Format: Summarized_Mechanical_Props_Campaign
Meant to be loaded in Python as a pandas DataFrame with multi-indexing, e.g.,
tab1 = pd.read_csv('Summarized_Mechanical_Props_Campaign_' + date + version + '.csv',
index_col=[0, 1, 2, 3], skipinitialspace=True, header=[0, 1],
keep_default_na=False, na_values='')
Caveats
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
With the ongoing energy transition, power grids are evolving fast. They operate more and more often close to their technical limit, under more and more volatile conditions. Fast, essentially real-time computational approaches to evaluate their operational safety, stability and reliability are therefore highly desirable. Machine Learning methods have been advocated to solve this challenge, however they are heavy consumers of training and testing data, while historical operational data for real-world power grids are hard if not impossible to access.
This dataset contains long time series for production, consumption, and line flows, amounting to 20 years of data with a time resolution of one hour, for several thousands of loads and several hundreds of generators of various types representing the ultra-high-voltage transmission grid of continental Europe. The synthetic time series have been statistically validated agains real-world data.
The algorithm is described in a Nature Scientific Data paper. It relies on the PanTaGruEl model of the European transmission network -- the admittance of its lines as well as the location, type and capacity of its power generators -- and aggregated data gathered from the ENTSO-E transparency platform, such as power consumption aggregated at the national level.
The network information is encoded in the file europe_network.json. It is given in PowerModels format, which it itself derived from MatPower and compatible with PandaPower. The network features 7822 power lines and 553 transformers connecting 4097 buses, to which are attached 815 generators of various types.
The time series forming the core of this dataset are given in CSV format. Each CSV file is a table with 8736 rows, one for each hourly time step of a 364-day year. All years are truncated to exactly 52 weeks of 7 days, and start on a Monday (the load profiles are typically different during weekdays and weekends). The number of columns depends on the type of table: there are 4097 columns in load files, 815 for generators, and 8375 for lines (including transformers). Each column is described by a header corresponding to the element identifier in the network file. All values are given in per-unit, both in the model file and in the tables, i.e. they are multiples of a base unit taken to be 100 MW.
There are 20 tables of each type, labeled with a reference year (2016 to 2020) and an index (1 to 4), zipped into archive files arranged by year. This amount to a total of 20 years of synthetic data. When using loads, generators, and lines profiles together, it is important to use the same label: for instance, the files loads_2020_1.csv, gens_2020_1.csv, and lines_2020_1.csv represent a same year of the dataset, whereas gens_2020_2.csv is unrelated (it actually shares some features, such as nuclear profiles, but it is based on a dispatch with distinct loads).
The time series can be used without a reference to the network file, simply using all or a selection of columns of the CSV files, depending on the needs. We show below how to select series from a particular country, or how to aggregate hourly time steps into days or weeks. These examples use Python and the data analyis library pandas, but other frameworks can be used as well (Matlab, Julia). Since all the yearly time series are periodic, it is always possible to define a coherent time window modulo the length of the series.
This example illustrates how to select generation data for Switzerland in Python. This can be done without parsing the network file, but using instead gens_by_country.csv, which contains a list of all generators for any country in the network. We start by importing the pandas library, and read the column of the file corresponding to Switzerland (country code CH):
import pandas as pd
CH_gens = pd.read_csv('gens_by_country.csv', usecols=['CH'], dtype=str)
The object created in this way is Dataframe with some null values (not all countries have the same number of generators). It can be turned into a list with:
CH_gens_list = CH_gens.dropna().squeeze().to_list()
Finally, we can import all the time series of Swiss generators from a given data table with
pd.read_csv('gens_2016_1.csv', usecols=CH_gens_list)
The same procedure can be applied to loads using the list contained in the file loads_by_country.csv.
This second example shows how to change the time resolution of the series. Suppose that we are interested in all the loads from a given table, which are given by default with a one-hour resolution:
hourly_loads = pd.read_csv('loads_2018_3.csv')
To get a daily average of the loads, we can use:
daily_loads = hourly_loads.groupby([t // 24 for t in range(24 * 364)]).mean()
This results in series of length 364. To average further over entire weeks and get series of length 52, we use:
weekly_loads = hourly_loads.groupby([t // (24 * 7) for t in range(24 * 364)]).mean()
The code used to generate the dataset is freely available at https://github.com/GeeeHesso/PowerData. It consists in two packages and several documentation notebooks. The first package, written in Python, provides functions to handle the data and to generate synthetic series based on historical data. The second package, written in Julia, is used to perform the optimal power flow. The documentation in the form of Jupyter notebooks contains numerous examples on how to use both packages. The entire workflow used to create this dataset is also provided, starting from raw ENTSO-E data files and ending with the synthetic dataset given in the repository.
This work was supported by the Cyber-Defence Campus of armasuisse and by an internal research grant of the Engineering and Architecture domain of HES-SO.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
3D skeletons UP-Fall Dataset
Different between Fall and Impact detection
Overview
This dataset aims to facilitate research in fall detection, particularly focusing on the precise detection of impact moments within fall events. The 3D skeletons data accuracy and comprehensiveness make it a valuable resource for developing and benchmarking fall detection algorithms. The dataset contains 3D skeletal data extracted from fall events and daily activities of 5 subjects performing fall scenarios
Data Collection
The skeletal data was extracted using a pose estimation algorithm, which processes images frames to determine the 3D coordinates of each joint. Sequences with less than 100 frames of extracted data were excluded to ensure the quality and reliability of the dataset. As a result, some subjects may have fewer CSV files.
CSV Structure
The data is organized by subjects, and each subject contains CSV files named according to the pattern C1S1A1T1, where:
C: Camera (1 or 2)
S: Subject (1 to 5)
A: Activity (1 to N, representing different activities)
T: Trial (1 to 3)
subject1/`: Contains CSV files for Subject 1.
C1S1A1T1.csv: Data from Camera 1, Activity 1, Trial 1 for Subject 1
C1S1A2T1.csv: Data from Camera 1, Activity 2, Trial 1 for Subject 1
C1S1A3T1.csv: Data from Camera 1, Activity 3, Trial 1 for Subject 1
C2S1A1T1.csv: Data from Camera 2, Activity 1, Trial 1 for Subject 1
C2S1A2T1.csv: Data from Camera 2, Activity 2, Trial 1 for Subject 1
C2S1A3T1.csv: Data from Camera 2, Activity 3, Trial 1 for Subject 1
subject2/`: Contains CSV files for Subject 2.
C1S2A1T1.csv: Data from Camera 1, Activity 1, Trial 1 for Subject 2
C1S2A2T1.csv: Data from Camera 1, Activity 2, Trial 1 for Subject 2
C1S2A3T1.csv: Data from Camera 1, Activity 3, Trial 1 for Subject 2
C2S2A1T1.csv: Data from Camera 2, Activity 1, Trial 1 for Subject 2
C2S2A2T1.csv: Data from Camera 2, Activity 2, Trial 1 for Subject 2
C2S2A3T1.csv: Data from Camera 2, Activity 3, Trial 1 for Subject 2
subject3/, subject4/, subject5/: Similar structure as above, but may contain fewer CSV files due to the data extraction criteria mentioned above.
Column Descriptions
Each CSV file contains the following columns representing different skeletal joints and their respective coordinates in 3D space:
Column Name
Description
joint_1_x
X coordinate of joint 1
joint_1_y
Y coordinate of joint 1
joint_1_z
Z coordinate of joint 1
joint_2_x
X coordinate of joint 2
joint_2_y
Y coordinate of joint 2
joint_2_z
Z coordinate of joint 2
...
...
joint_n_x
X coordinate of joint n
joint_n_y
Y coordinate of joint n
joint_n_z
Z coordinate of joint n
LABEL
Label indicating impact (1) or non-impact (0)
Example
Here is an example of what a row in one of the CSV files might look like:
joint_1_x
joint_1_y
joint_1_z
joint_2_x
joint_2_y
joint_2_z
...
joint_n_x
joint_n_y
joint_n_33
LABEL
0.123
0.456
0.789
0.234
0.567
0.890
...
0.345
0.678
0.901
0
Usage
This data can be used for developing and benchmarking impact fall detection algorithms. It provides detailed information on human posture and movement during falls, making it suitable for machine learning and deep learning applications in impact fall detection and prevention.
Using github
Clone the repository:
-bash git clone
https://github.com/Tresor-Koffi/3D_skeletons-UP-Fall-Dataset
Navigate to the directory:
-bash -cd 3D_skeletons-UP-Fall-Dataset
Examples
Here's a simple example of how to load and inspect a sample data file using Python:```pythonimport pandas as pd
data = pd.read_csv('subject1/C1S1A1T1.csv')print(data.head())
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The speed, direction of the wind and the variable wind indicator are the variables recorded by the meteorological network of the Chilean Meteorological Directorate (DMC). This collection contains the information stored by 326 stations that have recorded, at some point, the orientation of the wind since 1950, spaced one hour apart. It is important to note that not all stations are currently operational.
The data is updated directly from the DMC's web services and can be viewed in the Data Series viewer of the Itrend Data Platform.
In addition, a historical database is provided in .npz* and .mat** format that is updated every 30 days for those stations that are still valid.
*To load the data correctly in Python it is recommended to use the following code:
import numpy as np
with np.load(filename, allow_pickle = True) as f:
data = {}
for key, value in f.items():
data[key] = value.item()
**Date data is in datenum
format, and to load it correctly in datetime
format, it is recommended to use the following command in MATLAB:
datetime(TS.x , 'ConvertFrom' , 'datenum')
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains all the data tables and Python scripts necessary to generate the results presented in Le Minor et al. (2022): "Multi Grain-Size Total Sediment Load Model Based on the Disequilibrium Length".
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data used in the various stage two experiments in: "Comparing Clustering Approaches for Smart Meter Time Series: Investigating the Influence of Dataset Properties on Performance". This includes datasets with varied characteristics.All datasets are stored in a dict with tuples of (time series array, class labels). To access data in python:import picklefilename = "dataset.txt"with open(filename, 'rb') as f: data = pickle.load(f)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data used in the stage one 1NN classification experiment in: "Comparing Clustering Approaches for Smart Meter Time Series: Investigating the Influence of Dataset Properties on Performance"All datasets are stored in a dict with tuples of (time series array, class labels). To access data in python:import picklefilename = "dataset.txt"with open(filename, 'rb') as f: data = pickle.load(f)
CIFAR-10 Python (in CSV): LINK
The CIFAR-100 dataset consists of 60000 32x32 colour images in 100 classes, with 600 images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). There are 50000 training images and 10000 test images. The meta file contains the label names of each class and superclass.
Here is the list of the 100 classes in the CIFAR-100:
Classes: 1-5) beaver, dolphin, otter, seal, whale 6-10) aquarium fish, flatfish, ray, shark, trout 11-15) orchids, poppies, roses, sunflowers, tulips 16-20) bottles, bowls, cans, cups, plates 21-25) apples, mushrooms, oranges, pears, sweet peppers 26-30) clock, computer keyboard, lamp, telephone, television 31-35) bed, chair, couch, table, wardrobe 36-40) bee, beetle, butterfly, caterpillar, cockroach 41-45) bear, leopard, lion, tiger, wolf 46-50) bridge, castle, house, road, skyscraper 51-55) cloud, forest, mountain, plain, sea 56-60) camel, cattle, chimpanzee, elephant, kangaroo 61-65) fox, porcupine, possum, raccoon, skunk 66-70) crab, lobster, snail, spider, worm 71-75) baby, boy, girl, man, woman 76-80) crocodile, dinosaur, lizard, snake, turtle 81-85) hamster, mouse, rabbit, shrew, squirrel 86-90) maple, oak, palm, pine, willow 91-95) bicycle, bus, motorcycle, pickup truck, train 96-100) lawn-mower, rocket, streetcar, tank, tractor
and the list of the 20 superclasses: 1) aquatic mammals (classes 1-5) 2) fish (classes 6-10) 3) flowers (classes 11-15) 4) food containers (classes 16-20) 5) fruit and vegetables (classes 21-25) 6) household electrical devices (classes 26-30) 7) household furniture (classes 31-35) 8) insects (classes 36-40) 9) large carnivores (classes 41-45) 10) large man-made outdoor things (classes 46-50) 11) large natural outdoor scenes (classes 51-55) 12) large omnivores and herbivores (classes 56-60) 13) medium-sized mammals (classes 61-65) 14) non-insect invertebrates (classes 66-70) 15) people (classes 71-75) 16) reptiles (classes 76-80) 17) small mammals (classes 81-85) 18) trees (classes 86-90) 19) vehicles 1 (classes 91-95) 20) vehicles 2 (classes 96-100)
The function used to open each file:
def unpickle(file):
import pickle
with open(file, 'rb') as fo:
dict = pickle.load(fo, encoding='bytes')
return dict
Example of how to read the metadata and the superclasses:
metadata_path = './cifar-100-python/meta' # change this path`\
metadata = unpickle(metadata_path)
superclass_dict = dict(list(enumerate(metadata[b'coarse_label_names'])))
How to load the training and test sets (using superclasses): ``` data_pre_path = './cifar-100-python/' # change this path
data_train_path = data_pre_path + 'train' data_test_path = data_pre_path + 'test'
data_train_dict = unpickle(data_train_path) data_test_dict = unpickle(data_test_path)
data_train = data_train_dict[b'data'] label_train = np.array(data_train_dict[b'coarse_labels']) data_test = data_test_dict[b'data'] label_test = np.array(data_test_dict[b'coarse_labels']) ```
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2352583%2F868a18fb09d7a1d3da946d74a9857130%2FLogo.PNG?generation=1604973725053566&alt=media" alt="">
Medical Dataset for Abbreviation Disambiguation for Natural Language Understanding (MeDAL) is a large medical text dataset curated for abbreviation disambiguation, designed for natural language understanding pre-training in the medical domain. It was published at the ClinicalNLP workshop at EMNLP.
💻 Code 🤗 Dataset (Hugging Face) 💾 Dataset (Kaggle) 💽 Dataset (Zenodo) 📜 Paper (ACL) 📝 Paper (Arxiv) ⚡ Pre-trained ELECTRA (Hugging Face)
We recommend downloading from Kaggle if you can authenticate through their API. The advantage to Kaggle is that the data is compressed, so it will be faster to download. Links to the data can be found at the top of the readme.
First, you will need to create an account on kaggle.com. Afterwards, you will need to install the kaggle API:
pip install kaggle
Then, you will need to follow the instructions here to add your username and key. Once that's done, you can run:
kaggle datasets download xhlulu/medal-emnlp
Now, unzip everything and place them inside the data
directory:
unzip -nq crawl-300d-2M-subword.zip -d data
mv data/pretrain_sample/* data/
For the LSTM models, we will need to use the fastText embeddings. To do so, first download and extract the weights:
wget -nc -P data/ https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M-subword.zip
unzip -nq data/crawl-300d-2M-subword.zip -d data/
You can directly load LSTM and LSTM-SA with torch.hub
:
```python
import torch
lstm = torch.hub.load("BruceWen120/medal", "lstm") lstm_sa = torch.hub.load("BruceWen120/medal", "lstm_sa") ```
If you want to use the Electra model, you need to first install transformers:
pip install transformers
Then, you can load it with torch.hub
:
python
import torch
electra = torch.hub.load("BruceWen120/medal", "electra")
transformers
If you are only interested in the pre-trained ELECTRA weights (without the disambiguation head), you can load it directly from the Hugging Face Repository:
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained("xhlu/electra-medal")
tokenizer = AutoTokenizer.from_pretrained("xhlu/electra-medal")
Download the bibtex
here, or copy the text below:
@inproceedings{wen-etal-2020-medal,
title = "{M}e{DAL}: Medical Abbreviation Disambiguation Dataset for Natural Language Understanding Pretraining",
author = "Wen, Zhi and Lu, Xing Han and Reddy, Siva",
booktitle = "Proceedings of the 3rd Clinical Natural Language Processing Workshop",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.clinicalnlp-1.15",
pages = "130--135",
}
The ELECTRA model is licensed under Apache 2.0. The license for the libraries used in this project (transformers
, pytorch
, etc.) can be found in their respective GitHub repository. Our model is released under a MIT license.
The original dataset was retrieved and modified from the NLM website. By using this dataset, you are bound by the terms and conditions specified by NLM:
INTRODUCTION
Downloading data from the National Library of Medicine FTP servers indicates your acceptance of the following Terms and Conditions: No charges, usage fees or royalties are paid to NLM for this data.
MEDLINE/PUBMED SPECIFIC TERMS
NLM freely provides PubMed/MEDLINE data. Please note some PubMed/MEDLINE abstracts may be protected by copyright.
GENERAL TERMS AND CONDITIONS
Users of the data agree to:
- acknowledge NLM as the source of the data by including the phrase "Courtesy of the U.S. National Library of Medicine" in a clear and conspicuous manner,
- properly use registration and/or trademark symbols when referring to NLM products, and
- not indicate or imply that NLM has endorsed its products/services/applications.
Users who republish or redistribute the data (services, products or raw data) agree to:
- maintain the most current version of all distributed data, or
- make known in a clear and conspicuous manner that the products/services/applications do not reflect the most current/accurate data available from NLM.
These data are produced with a reasonable standard of care, but NLM makes no warranties express or implied, including no warranty of merchantability or fitness for particular purpose, regarding the accuracy or completeness of the data. Users agree to hold NLM and the U.S. Government harmless from any liability resulting from errors in the data. NLM disclaims any liability for any consequences due to use, misuse, or interpretation of information contained or not contained in the data.
NLM does not provide legal advice regarding copyright, fair use, or other aspects of intellectual property rights. See the NLM Copyright page.
NLM reserves the right to change the type and format of its machine-readable data. NLM will take reasonable steps to inform users of any changes to the format of the data before the data are distributed via the announcement section or subscription to email and RSS updates.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains simultaneous electroencephalography (EEG) and near-infrared spectroscopy (fNIRS) signals recorded from 12 participants while performing a silent naming task and three sensory-based imagery tasks using visual, auditory, and tactile perception. Participants were asked to visualize an object in their minds, imagine the sounds made by the object, and imagine the feeling of touching the object.
EEG data were acquired with a BioSemi ActiveTwo system with 64 electrodes positioned according to the international 10-20 system, plus one electrode on each earlobe as references ('EXG1' channel is the left ear electrode and 'EXG2' channel is the right ear electrode). Additionally, 2 electrodes placed on the left hand measured galvanic skin response ('GSR1' channel) and a respiration belt around the waist measured respiration ('Resp' channel). The sampling rate was 2048 Hz.
The electrode names were saved in a default BioSemi labeling scheme (A1-A32, B1-B32). See the Biosemi documentation for the corresponding international 10-20 naming scheme (https://www.biosemi.com/pics/cap_64_layout_medium.jpg, https://www.biosemi.com/headcap.htm).
For convenience, the following ordered channels
['A1', 'A2', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8', 'A9', 'A10', 'A11', 'A12', 'A13', 'A14', 'A15', 'A16', 'A17', 'A18', 'A19', 'A20', 'A21', 'A22', 'A23', 'A24', 'A25', 'A26', 'A27', 'A28', 'A29', 'A30', 'A31', 'A32', 'B1', 'B2', 'B3', 'B4', 'B5', 'B6', 'B7', 'B8', 'B9', 'B10', 'B11', 'B12', 'B13', 'B14', 'B15', 'B16', 'B17', 'B18', 'B19', 'B20', 'B21', 'B22', 'B23', 'B24', 'B25', 'B26', 'B27', 'B28', 'B29', 'B30', 'B31', 'B32']
can thus be renamed to
['Fp1', 'AF7', 'AF3', 'F1', 'F3', 'F5', 'F7', 'FT7', 'FC5', 'FC3', 'FC1', 'C1', 'C3', 'C5', 'T7', 'TP7', 'CP5', 'CP3', 'CP1', 'P1', 'P3', 'P5', 'P7', 'P9', 'PO7', 'PO3', 'O1', 'Iz', 'Oz', 'POz', 'Pz', 'CPz', 'Fpz', 'Fp2', 'AF8', 'AF4', 'AFz', 'Fz', 'F2', 'F4', 'F6', 'F8', 'FT8', 'FC6', 'FC4', 'FC2', 'FCz', 'Cz', 'C2', 'C4', 'C6', 'T8', 'TP8', 'CP6', 'CP4', 'CP2', 'P2', 'P4', 'P6', 'P8', 'P10', 'PO8', 'PO4', 'O2']
fNIRS data were acquired with a NIRx NIRScoutXP continuous wave imaging system equipped with 4 light detectors, 8 light emitters (sources), and low-profile fNIRS optodes. Both electrodes and optodes were placed in a NIRx NIRScap for integrated fNIRS-EEG layouts. Two different montages were used: frontal and temporal, see references for more information.
Folder 'stimuli' contains all images of the semantic categories of animals and tools presented to participants.
We have prepared example scripts to demonstrate how to load the EEG and fNIRS data into Python using MNE and MNE-BIDS packages. These scripts are located in the 'code' directory.
This dataset was analyzed in the following publications:
[1] Rybář, M., Poli, R. and Daly, I., 2024. Using data from cue presentations results in grossly overestimating semantic BCI performance. Scientific Reports, 14(1), p.28003.
[2] Rybář, M., Poli, R. and Daly, I., 2021. Decoding of semantic categories of imagined concepts of animals and tools in fNIRS. Journal of Neural Engineering, 18(4), p.046035.
[3] Rybář, M., 2023. Towards EEG/fNIRS-based semantic brain-computer interfacing (Doctoral dissertation, University of Essex).
This dataset contains materials from the Coalition for Community-Supported Affordable Geothermal Energy Systems (C2SAGES) project, which evaluated the techno-economic feasibility of a community geothermal system for a residential development in Hinesburg, VT. The dataset includes detailed soil conductivity test reports, energy models, borehole design reports, hourly energy loads for heating, cooling, and hot water, and design layouts. EnergyPlus was used to model building energy loads, and Modelica software was applied for geothermal loop sizing based on these loads and soil conductivity results. Python scripts for network design further refined the models. Key files include PDF reports on borehole design (with projections for 1-year, 15-year, and 30-year systems), soil conductivity test results, EnergyPlus modeling outputs, and 2D/3D design drawings in PDF, DWG, and DXF formats. Python notebooks for network design and OnePipe model files are also provided, with Modelica required for viewing certain files. Outputs and modeling data are in various formats including CSV, JPG, HTML, and IDF, with units and data clearly labeled to support understanding of system design and performance for the proposed geothermal solution.
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.