Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Author: Andrew J. Felton
Date: 10/29/2024
This R project contains the primary code and data (following pre-processing in python) used for data production, manipulation, visualization, and analysis, and figure production for the study entitled:
"Global estimates of the storage and transit time of water through vegetation"
Please note that 'turnover' and 'transit' are used interchangeably. Also please note that this R project has been updated multiple times as the analysis has updated.
Data information:
The data folder contains key data sets used for analysis. In particular:
"data/turnover_from_python/updated/august_2024_lc/" contains the core datasets used in this study including global arrays summarizing five year (2016-2020) averages of mean (annual) and minimum (monthly) transit time, storage, canopy transpiration, and number of months of data able as both an array (.nc) or data table (.csv). These data were produced in python using the python scripts found in the "supporting_code" folder. The remaining files in the "data" and "data/supporting_data"" folder primarily contain ground-based estimates of storage and transit found in public databases or through a literature search, but have been extensively processed and filtered here. The "supporting_data"" folder also contains annual (2016-2020) MODIS land cover data used in the analysis and contains separate filters containing the original data (.hdf) and then the final process (filtered) data in .nc format. The resulting annual land cover distributions were used in the pre-processing of data in python.
#Code information
Python scripts can be found in the "supporting_code" folder.
Each R script in this project has a role:
"01_start.R": This script sets the working directory, loads in the tidyverse package (the remaining packages in this project are called using the `::` operator), and can run two other scripts: one that loads the customized functions (02_functions.R) and one for importing and processing the key dataset for this analysis (03_import_data.R).
"02_functions.R": This script contains custom functions. Load this using the
`source()` function in the 01_start.R script.
"03_import_data.R": This script imports and processes the .csv transit data. It joins the mean (annual) transit time data with the minimum (monthly) transit data to generate one dataset for analysis: annual_turnover_2. Load this using the
`source()` function in the 01_start.R script.
"04_figures_tables.R": This is the main workhouse for figure/table production and
supporting analyses. This script generates the key figures and summary statistics
used in the study that then get saved in the manuscript_figures folder. Note that all
maps were produced using Python code found in the "supporting_code"" folder.
"supporting_generate_data.R": This script processes supporting data used in the analysis, primarily the varying ground-based datasets of leaf water content.
"supporting_process_land_cover.R": This takes annual MODIS land cover distributions and processes them through a multi-step filtering process so that they can be used in preprocessing of datasets in python.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Vezora (From Huggingface) [source]
The Vezora/Tested-188k-Python-Alpaca dataset is a comprehensive collection of functional Python code samples, specifically designed for training and analysis purposes. With 188,000 samples, this dataset offers an extensive range of examples that cater to the research needs of Python programming enthusiasts.
This valuable resource consists of various columns, including input, which represents the input or parameters required for executing the Python code sample. The instruction column describes the task or objective that the Python code sample aims to solve. Additionally, there is an output column that showcases the resulting output generated by running the respective Python code.
By utilizing this dataset, researchers can effectively study and analyze real-world scenarios and applications of Python programming. Whether for educational purposes or development projects, this dataset serves as a reliable reference for individuals seeking practical examples and solutions using Python
The Vezora/Tested-188k-Python-Alpaca dataset is a comprehensive collection of functional Python code samples, containing 188,000 samples in total. This dataset can be a valuable resource for researchers and programmers interested in exploring various aspects of Python programming.
Contents of the Dataset
The dataset consists of several columns:
- output: This column represents the expected output or result that is obtained when executing the corresponding Python code sample.
- instruction: It provides information about the task or instruction that each Python code sample is intended to solve.
- input: The input parameters or values required to execute each Python code sample.
Exploring the Dataset
To make effective use of this dataset, it is essential to understand its structure and content properly. Here are some steps you can follow:
- Importing Data: Load the dataset into your preferred environment for data analysis using appropriate tools like pandas in Python.
import pandas as pd # Load the dataset df = pd.read_csv('train.csv')
- Understanding Column Names: Familiarize yourself with the column names and their meanings by referring to the provided description.
# Display column names print(df.columns)
- Sample Exploration: Get an initial understanding of the data structure by examining a few random samples from different columns.
# Display random samples from 'output' column print(df['output'].sample(5))
- Analyzing Instructions: Analyze different instructions or tasks present in the 'instruction' column to identify specific areas you are interested in studying or learning about.
# Count unique instructions and display top ones with highest occurrences instruction_counts = df['instruction'].value_counts() print(instruction_counts.head(10))
Potential Use Cases
The Vezora/Tested-188k-Python-Alpaca dataset can be utilized in various ways:
- Code Analysis: Analyze the code samples to understand common programming patterns and best practices.
- Code Debugging: Use code samples with known outputs to test and debug your own Python programs.
- Educational Purposes: Utilize the dataset as a teaching tool for Python programming classes or tutorials.
- Machine Learning Applications: Train machine learning models to predict outputs based on given inputs.
Remember that this dataset provides a plethora of diverse Python coding examples, allowing you to explore different
- Code analysis: Researchers and developers can use this dataset to analyze various Python code samples and identify patterns, best practices, and common mistakes. This can help in improving code quality and optimizing performance.
- Language understanding: Natural language processing techniques can be applied to the instruction column of this dataset to develop models that can understand and interpret natural language instructions for programming tasks.
- Code generation: The input column of this dataset contains the required inputs for executing each Python code sample. Researchers can build models that generate Python code based on specific inputs or task requirements using the examples provided in this dataset. This can be useful in automating repetitive programming tasks o...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
F-DATA is a novel workload dataset containing the data of around 24 million jobs executed on Supercomputer Fugaku, over the three years of public system usage (March 2021-April 2024). Each job data contains an extensive set of features, such as exit code, duration, power consumption and performance metrics (e.g. #flops, memory bandwidth, operational intensity and memory/compute bound label), which allows for a multitude of job characteristics prediction. The full list of features can be found in the file feature_list.csv.
The sensitive data appears both in anonymized and encoded versions. The encoding is based on a Natural Language Processing model and retains sensitive but useful job information for prediction purposes, without violating data privacy. The scripts used to generate the dataset are available in the F-DATA GitHub repository, along with a series of plots and instruction on how to load the data.
F-DATA is composed of 38 files, with each YY_MM.parquet file containing the data of the jobs submitted in the month MM of the year YY.
The files of F-DATA are saved as .parquet files. It is possible to load such files as dataframes by leveraging the pandas APIs, after installing pyarrow (pip install pyarrow). A single file can be read with the following Python instrcutions:
import pandas as pd
df = pd.read_parquet("21_01.parquet")
df.head()
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Data was imported from the BAK file found here into SQL Server, and then individual tables were exported as CSV. Jupyter Notebook containing the code used to clean the data can be found here
Version 6 has a some more cleaning and structuring that was noticed after importing in Power BI. Changes were made by adding code in python notebook to export new cleaned dataset, such as adding MonthNumber for sorting by month number, similar for WeekDayNumber.
Cleaning was done in python while also using SQL Server to quickly find things. Headers were added separately, ensuring no data loss.Data was cleaned for NaN, garbage values and other columns.
The MNIST database of handwritten digits.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('mnist', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/mnist-3.0.1.png" alt="Visualization" width="500px">
Fashion-MNIST is a dataset of Zalando's article images consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('fashion_mnist', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/fashion_mnist-3.0.1.png" alt="Visualization" width="500px">
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
code.zip: Zip folder containing a folder titled "code" which holds:
csv file titled "MonitoredRainGardens.csv" containing the 14 monitored green infrastructure (GI) sites with their design and physiographic features;
csv file titled "storm_constants.csv" which contain the computed decay constants for every storm in every GI during the measurement period;
csv file titled "newGIsites_AllData.csv" which contain the other 130 GI sites in Detroit and their design and physiographic features;
csv file titled "Detroit_Data_MeanDesignFeatures.csv" which contain the design and physiographic features for all of Detroit;
Jupyter notebook titled "GI_GP_SensorPlacement.ipynb" which provides the code for training the GP models and displaying the sensor placement results;
a folder titled "MATLAB" which contains the following:
folder titled "SFO" which contains the SFO toolbox for the sensor placement work
file titled "sensor_placement.mlx" that contains the code for the sensor placement work
several .mat files created in Python for importing into Matlab for the sensor placement work: "constants_sigma.mat", "constants_coords.mat", "GInew_sigma.mat", "GInew_coords.mat", and "R1_sensor.mat" through "R6_sensor.mat"
several .mat files created in Matalb for importing into Python for visualizing the results: "MI_DETselectedGI.mat" and "DETselectedGI.mat"
By Homeland Infrastructure Foundation [source]
The Cellular_Service_Areas.csv dataset provides detailed information about the boundaries and characteristics of cellular service areas. It categorizes these areas based on their respective callsigns, which are unique identifiers associated with each service area.
The dataset includes several columns that provide valuable information for analyzing cellular service areas. The Shape_Area column represents the area of each service area's shape or boundary, measured in numeric or float values. This allows researchers to understand the extent and size of each coverage zone.
The Shape_Leng column provides the length of the shape or boundary for each cellular service area. Like Shape_Area, it is also represented in numeric or float values and offers insight into the physical dimensions of these zones.
Another significant piece of data provided is the CALL column, which contains text or string values representing the callsign associated with each cellular service area. Callsigns serve as unique identifiers for allocating frequencies among different operators and can be indicative of specific mobile network providers.
Additionally, this dataset includes information about license holders in the LICENSEE column. This text/string field specifies the company or organization that holds a license for operating within a particular cellular service area. It allows users to identify and analyze ownership patterns within this industry.
By combining all these attributes, researchers can gain a comprehensive understanding of different aspects related to cellular service areas like their shapes, sizes, callsign associations, and licensee organizations. The data can enable various analyses such as coverage comparisons between different providers' zones or evaluating geographical distribution trends among license holders.
Overall, this dataset serves as a valuable resource for studying and mapping out cellular service areas categorized by callsigns to better comprehend their geographic reach and industry dynamics
1. Downloading the Dataset
You can download the dataset from here. Simply click on the link and select the file format that suits your requirements (e.g., CSV, XLSX).
2. Explore the Dataset
The dataset contains several columns providing valuable information about cellular service areas. Here is a brief description of each column:
- CALL: The callsign associated with the cellular service area.
- LICENSEE: The company or organization that holds the license for the cellular service area.
- Shape_Leng: The length of the shape or boundary of the cellular service area.
- Shape_Area: The area of the shape or boundary of the cellular service area.
By analyzing these columns, you can gain insights into different aspects related to cellular coverage areas across various companies and regions.
3. Importing and Loading Data
Once you have downloaded and saved your preferred file format from this dataset, you can import it into your data analysis tool (such as Python's Pandas library) using standard file reading functions specific to your tool.
For example, in Python using Pandas: ```python import pandas as pd
Load CSV file into a DataFrame
df = pd.read_csv('Cellular_Service_Areas.csv')
Explore data using various DataFrame operations
Make sure to adjust your code accordingly based on your chosen programming language and tools. ## 4. Analyzing and Visualizing Data With the dataset loaded, you can start exploring and analyzing it to uncover meaningful insights. Here are a few potential analysis and visualization ideas: - **Coverage Analysis**: Group the cellular service areas by **LICENSEE** to understand how different companies or organizations distribute their coverage across different regions. - **Size Analysis**: Calculate statistics for **Shape_Area** column to identify the largest and smallest cellular service areas. - **Length Analysis**: Analyze the information in the **Shape_Leng** column to determine variations in boundary lengths among different callsigns. - **Geospatial Visualization**: Utilize geospatial libraries such as GeoPandas or GIS
- Network Coverage Analysis: By analyzing the boundaries of cellular service areas, this dataset can be used to assess the coverage a...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset containing measurements of Linux Kernel binary size after compilation. The reported size, in the column "perf", is the size in bytes of the vmlinux file. In contains also a column "active_options" reporting the number of activated options (set at "y"). All other columns, the list being reported in the file "Linux_options.json", are Linux kernel options. The sampling have been made using randconfig. The version of Linux used is 4.13.3.
Not all available options are present. First, it only contains options about the x86 and 64 bits version. Then, all non-tristate options have been ignored. Finally, options not having multiple value through the whole dataset, due to not enough variability in the sampling, are ignored. All options are encoded as 0 for "n" and "m" options value, and 1 for "y".
In python, importing the dataset using pandas will attribute all columns to int64, which will lead to a great consumption of memory (~50GB). We provide this way to import it using less than 1 GB of memory by setting options columns to int8.
import pandas as pd import json import numpy
with open("Linux_options.json","r") as f: linux_options = json.load(f)
return pd.read_csv("Linux.csv", dtype={f:numpy.int8 for f in linux_options})
https://spdx.org/licenses/CC0-1.0https://spdx.org/licenses/CC0-1.0
The HURRECON model estimates wind speed, wind direction, enhanced Fujita scale wind damage, and duration of EF0 to EF5 winds as a function of hurricane location and maximum sustained wind speed. Results may be generated for a single site or an entire region. Hurricane track and intensity data may be imported directly from the US National Hurricane Center's HURDAT2 database. HURRECON is available in R and Python. The R version is available on CRAN as HurreconR. The model is an updated version of the original HURRECON model written in Borland Pascal for use with Idrisi (see HF025). New features include support for: (1) estimating wind damage on the enhanced Fujita scale, (2) importing hurricane track and intensity data directly from HURDAT2, (3) creating a land-water file with user-selected geographic coordinates and spatial resolution, and (4) creating plots of site and regional results. The model equations for estimating wind speed and direction, including parameter values for inflow angle, friction factor, and wind gust factor (over land and water), are unchanged from the original HURRECON model. For more details and sample datasets, see the project website on GitHub (https://github.com/hurrecon-model).
The First Public Data Release (DR1) of Transient Host Exchange (THEx) Dataset Paper describing the dataset: “Linking Extragalactic Transients and their Host Galaxy Properties: Transient Sample, Multi-Wavelength Host Identification, and Database Construction” (Qin et al. 2021) The data release contains four compressed archives. “BSON export” is a binary export of the “host_summary” collection, which is the “full version” of the dataset. The schema was presented in the Appendix section of the paper. You need to set up a MongoDB server to use this version of the dataset. After setting up the server, you may import this BSON file into your local database as a collection using “mongorestore” command. You may find some useful tutorials for setting up the server and importing BSON files into your local database at: https://docs.mongodb.com/manual/installation/ https://www.mongodb.com/basics/bson You may run common operations like query and aggregation once you import this BSON snapshot into your local database. An official tutorial can be found at: https://docs.mongodb.com/manual/tutorial/query-documents/ There are other packages (e.g., pymongo for Python) and software to perform these database operations. “JSON export” is a compressed archive of JSON files. Each file, named by the unique id and the preferred name of the event, contains complete host data of a single event. The data schema and contents are identical to the “BSON” version. “NumPy export” contains a series of NumPy tables in “npy” format. There is a row-to-row correspondence across these files. Except for the “master table” (THEx-v8.0-release-assembled.npy), which contains all the columns, each file contains the host properties cross-matched in a single external catalog. The meta info and ancillary data are summarized in THEx-v8.0-release-assembled-index.npy. There is also a THEx-v8.0-release-typerowmask.npy file, which has rows co-indexed with other files and columns named after each transient type. The “rowmask” file allows you to select a subset of events under a specific transient type. Note that in this version, we only include cataloged properties of the confirmed hosts or primary candidates. If the confirmed host (or primary candidate) cross-matched multiple sources in a specific catalog, we only use the representative source for host properties. Properties of other cross-matched groups are not included. Finally, table THEx-v8.0-release-MWExt.npy contains the calculated foreground extinction (in magnitudes) at host positions. These extinction values have not been applied to magnitude columns in our dataset. You need to perform this correction by yourself if desired. “FITS export” includes the same individual tables as in “NumPy export”. However, the FITS standard limits the number of columns in a table. Therefore, we do not include the “master table” in “FITS export.” Finally, in BSON and JSON versions, cross-matched groups (under the “groups” key) are ordered by the default ranking function. Even if the first group in this list (namely, the confirmed host or primary host candidate) is a mismatched or misidentified one, we keep it in its original position. The result of visual inspection, including our manual reassignments, has been summarized under the “vis_insp” key. For NumPy and FITS versions, if we have manually reassigned the host of an event, the data presented in these tables are also updated accordingly. You may use the “case_code” column in the “index” file to find the result of visual inspection and manual reassignment, where the flags for this “case_code” column are summarized in case-code.txt. Generally, codes “A1” and “F1” are known and new hosts that passed our visual inspection, while codes “B1” and “G1” are mismatched known hosts and possibly misidentified new hosts that have been manually reassigned.
Source code, documentation, and examples of use of the source code for the Dioptra Test Platform.Dioptra is a software test platform for assessing the trustworthy characteristics of artificial intelligence (AI). Trustworthy AI is: valid and reliable, safe, secure and resilient, accountable and transparent, explainable and interpretable, privacy-enhanced, and fair - with harmful bias managed1. Dioptra supports the Measure function of the NIST AI Risk Management Framework by providing functionality to assess, analyze, and track identified AI risks.Dioptra provides a REST API, which can be controlled via an intuitive web interface, a Python client, or any REST client library of the user's choice for designing, managing, executing, and tracking experiments. Details are available in the project documentation available at https://pages.nist.gov/dioptra/.Use CasesWe envision the following primary use cases for Dioptra:- Model Testing: -- 1st party - Assess AI models throughout the development lifecycle -- 2nd party - Assess AI models during acquisition or in an evaluation lab environment -- 3rd party - Assess AI models during auditing or compliance activities- Research: Aid trustworthy AI researchers in tracking experiments- Evaluations and Challenges: Provide a common platform and resources for participants- Red-Teaming: Expose models and resources to a red team in a controlled environmentKey PropertiesDioptra strives for the following key properties:- Reproducible: Dioptra automatically creates snapshots of resources so experiments can be reproduced and validated- Traceable: The full history of experiments and their inputs are tracked- Extensible: Support for expanding functionality and importing existing Python packages via a plugin system- Interoperable: A type system promotes interoperability between plugins- Modular: New experiments can be composed from modular components in a simple yaml file- Secure: Dioptra provides user authentication with access controls coming soon- Interactive: Users can interact with Dioptra via an intuitive web interface- Shareable and Reusable: Dioptra can be deployed in a multi-tenant environment so users can share and reuse components
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset was originally curated by Software Carpentry, a branch of The Carpentries non-profit organization, and is based on data from the Gapminder Foundation. It consists of six tabular CSV files containing GDP data for various countries across different years. The dataset was initially prepared for the Software Carpentry tutorial "Plotting and Programming in Python" and is also reused in the Galaxy Training Network (GTN) tutorial "Use Jupyter Notebooks in Galaxy."
This GTN tutorial provides an introduction to launching a Jupyter Notebook in Galaxy, installing dependencies, and importing and exporting data. It serves as a setup guide for a Jupyter Notebook environment that can be used to follow the Software Carpentry tutorial "Plotting and Programming in Python."
The Institute for the Design of Advanced Energy Systems (IDAES) Integrated Platform is a versatile computational environment offering extensive process systems engineering (PSE) capabilities for optimizing the design and operation of complex, interacting technologies and systems. IDAES enables users to efficiently search vast, complex design spaces to discover the lowest cost solutions while supporting the full process modeling lifecycle, from conceptual design to dynamic optimization and control. The extensible, open platform empowers users to create models of novel processes and rapidly develop custom analyses, workflows, and end-user applications. IDAES-PSE 2.6.0 Release Highlights Upcoming Changes IDAES will be switching to the new Pyomo solver interface in the next release. Whilst this will hopefully be a smooth transition for most users, there are a few important changes to be aware of. The new solver interface uses a different version of the IPOPT writer (“ipopt_v2”) and thus any custom configuration options you might have set for IPOPT will not carry over and will need to be reset. By default, the new Pyomo linear presolver will be activated with ipopt_v2. Whilst are working to identify any bugs in the presolver, it is possible that some edge cases will remain. IDAES will begin deploying amore » new set of scaling tools and APIs over the next few releases that make use of the new solver writers. The old scaling tools and APIs will remain for backward compatibility but will begin to be deprecated. New Models, Tools and Features New Intersphinx extension automatically linking Jupyter notebook examples to project documentation New end-to-end diagnostics example demonstrated on a real problem New complementarity formulation for VLE with cubic equations of state, backward compatibility for old formulation New solver interface with presolve (ipopt_v2) in support of upcoming changes to the initialization and APIs methods, with default set to ipopt to maintain backwards compatibility; this will deprecate once all examples have been updated New forecaster and parameterized bidder methods within grid integration library Updated surrogates API and examples to support Keras 3, with backwards compatibility for older formats such as TensorFlow SavedModel (TFSM) Updated costing base dictionary to include the 2023 cost year index value Updated ProcessBlock to include information on the constructing block class Updated Flowsheet Visualizer to allow visualize() method to return value and functions Bug Fixes Fixed bug in the Modular Property Framework that would cause errors when trying to use phase-based material balances with phase equilibria. Fixed bug in Modular Properties Framework that caused errors when initializing models with non-vapor-liquid phase equilibria. Fixed typos flagged by June update to crate-ci/typos and removed DMF-related exceptions Minor corrections of units of measurement handling in power plant waste/transport costing expressions, control volume material holdup expressions, and BTX property package parameters Fixed throwing >7500 numpy deprecation warnings by replacing scalar value assignment with element extraction and item iteration calls Testing and Robustness Migrated slow tests (>10s) to integration, impacting test coverage but also yielding a nearly 30% decrease in local test runtime Pinned pint to avoid issues with older supported Python versions Pinned codecov versions to avoid tokenless upload behavior with latest version Bumped extensions to version 3.4.2 to allow pointing to non-standard install location Deprecations and Removals Python 3.8 is no longer supported. The supported Python versions are 3.9 through 3.12 The Data Management Framework (DMF) is no longer supported. Importing idaes.core.dmf will cause a deprecation warning to be displayed until the next release The SOFC Keras surrogates have been removed. The current version of the SOFC surrogate model in the examples repository is a PySMO Kriging model.« less
QM9 consists of computed geometric, energetic, electronic, and thermodynamic properties for 134k stable small organic molecules made up of C, H, O, N, and F. As usual, we remove the uncharacterized molecules and provide the remaining 130,831.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('qm9', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By nlpai-lab (From Huggingface) [source]
This dataset provides a collection of translations from English to Korean for NLP models such as GPT4ALL, Dolly, and Vicuna Data. The translations were generated using the DeepL API. It contains three columns: instruction represents the instruction given to the model for the translation task, input is the input text that needs to be translated from English to Korean, and output is the corresponding translated text in Korean. The dataset aims to facilitate research and development in natural language processing tasks by providing a reliable source of translated data
This dataset contains Korean translations of instructions, inputs, and outputs for various NLP models including GPT4ALL, Dolly, and Vicuna Data. The translations were generated using the DeepL API.
Description of Columns
The dataset consists of the following columns:
instruction
: This column contains the original instruction given to the model for the translation task.
input
: This column contains the input text in English that needs to be translated to Korean.
output
: This column contains the translated text in Korean.How to Utilize this Dataset
You can use this dataset for various natural language processing (NLP) tasks such as machine translation or training language models specifically focused on English-Korean translation.
Here are a few steps on how you can utilize this dataset effectively:
Importing Data: Load or import the provided train.csv file into your Python environment or preferred programming language.
Data Preprocessing: Clean and preprocess both input and output texts if needed. You may consider tokenization, removing stopwords, or any other preprocessing techniques that align with your specific task requirements.
Model Training: Utilize deep learning frameworks like PyTorch or TensorFlow to develop your NLP model focused on English-Korean translation using this prepared dataset as training data.
Evaluation & Fine-tuning: Evaluate your trained model's performance using suitable metrics such as BLEU score or perplexity measurement techniques specific to machine translation tasks. Fine-tune your model by iterating over different architectures and hyperparameters based on evaluation results until desired performance is achieved.
Inference & Deployment: Once you are satisfied with your trained model's performance, use it for making predictions on unseen English texts which need translation into Korean within any application where it can provide meaningful value.
Remember that this dataset was translated using DeepL API; thus, you can leverage these translations as a starting point for your NLP projects. However, it is essential to validate and further refine the translations according to your specific use case or domain requirements.
Good luck with your NLP projects using this Korean Translation Dataset!
- Training and evaluating machine translation models: This dataset can be used to train and evaluate machine translation models for translating English text to Korean. The instruction column provides specific instructions given to the model, while the input column contains the English text that needs to be translated. The output column contains the corresponding translations in Korean.
- Language learning and practice: This dataset can be used by language learners who want to practice translating English text into Korean. Users can compare their own translations with the provided translations in the output column to improve their language skills.
- Benchmarking different translation APIs or models: The dataset includes translations generated using the DeepL API, but it can also be used as a benchmark for comparing other translation APIs or models. By comparing the performance of different systems on this dataset, researchers and developers can gain insights into the strengths and weaknesses of different translation approaches
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. [See Other Information](https:/...
The CBIS-DDSM (Curated Breast Imaging Subset of DDSM) is an updated and standardized version of the Digital Database for Screening Mammography (DDSM). The DDSM is a database of 2,620 scanned film mammography studies. It contains normal, benign, and malignant cases with verified pathology information.
The default config is made of patches extracted from the original mammograms, following the description from (http://arxiv.org/abs/1708.09427), in order to frame the task to solve in a traditional image classification setting.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('curated_breast_imaging_ddsm', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/curated_breast_imaging_ddsm-patches-3.0.0.png" alt="Visualization" width="500px">
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository provides the R-MRIO database for the years 1995–2005 of the study "A highly resolved MRIO database for analyzing environmental footprints and Green Economy Progress".
https://doi.org/10.1016/j.scitotenv.2020.142587
The code to resolve the database and the data for the years 2006–2015 are stored under the repository http://doi.org/10.5281/zenodo.3993659
The folders "R-MRIO_year" provide the following files (*.mat-files) for each year from 1995–2005:
A_RMRIO: the coefficient matrix
Y_RMRIO: the final demand matrix
Ext_RMRIO and Ext_hh_RMRIO: the satellite matrix of the economy and the final demand
TotalOut_RMRIO: the total output vector
The labels of the matrices are provided by the separate folder "Labels_RMRIO "
A script for importing and indexing the RMRIO database files in Python as Pandas DataFrames can be found here:
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Author: Andrew J. Felton
Date: 10/29/2024
This R project contains the primary code and data (following pre-processing in python) used for data production, manipulation, visualization, and analysis, and figure production for the study entitled:
"Global estimates of the storage and transit time of water through vegetation"
Please note that 'turnover' and 'transit' are used interchangeably. Also please note that this R project has been updated multiple times as the analysis has updated.
Data information:
The data folder contains key data sets used for analysis. In particular:
"data/turnover_from_python/updated/august_2024_lc/" contains the core datasets used in this study including global arrays summarizing five year (2016-2020) averages of mean (annual) and minimum (monthly) transit time, storage, canopy transpiration, and number of months of data able as both an array (.nc) or data table (.csv). These data were produced in python using the python scripts found in the "supporting_code" folder. The remaining files in the "data" and "data/supporting_data"" folder primarily contain ground-based estimates of storage and transit found in public databases or through a literature search, but have been extensively processed and filtered here. The "supporting_data"" folder also contains annual (2016-2020) MODIS land cover data used in the analysis and contains separate filters containing the original data (.hdf) and then the final process (filtered) data in .nc format. The resulting annual land cover distributions were used in the pre-processing of data in python.
#Code information
Python scripts can be found in the "supporting_code" folder.
Each R script in this project has a role:
"01_start.R": This script sets the working directory, loads in the tidyverse package (the remaining packages in this project are called using the `::` operator), and can run two other scripts: one that loads the customized functions (02_functions.R) and one for importing and processing the key dataset for this analysis (03_import_data.R).
"02_functions.R": This script contains custom functions. Load this using the
`source()` function in the 01_start.R script.
"03_import_data.R": This script imports and processes the .csv transit data. It joins the mean (annual) transit time data with the minimum (monthly) transit data to generate one dataset for analysis: annual_turnover_2. Load this using the
`source()` function in the 01_start.R script.
"04_figures_tables.R": This is the main workhouse for figure/table production and
supporting analyses. This script generates the key figures and summary statistics
used in the study that then get saved in the manuscript_figures folder. Note that all
maps were produced using Python code found in the "supporting_code"" folder.
"supporting_generate_data.R": This script processes supporting data used in the analysis, primarily the varying ground-based datasets of leaf water content.
"supporting_process_land_cover.R": This takes annual MODIS land cover distributions and processes them through a multi-step filtering process so that they can be used in preprocessing of datasets in python.