Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository includes MATLAB files and datasets related to the IEEE IIRW 2023 conference proceeding:T. Zanotti et al., "Reliability Analysis of Random Telegraph Noisebased True Random Number Generators," 2023 IEEE International Integrated Reliability Workshop (IIRW), South Lake Tahoe, CA, USA, 2023, pp. 1-6, doi: 10.1109/IIRW59383.2023.10477697
The repository includes:
The data of the bitmaps reported in Fig. 4, i.e., the results of the simulation of the ideal RTN-based TRNG circuit for different reseeding strategies. To load and plot the data use the "plot_bitmaps.mat" file.
The result of the circuit simulations considering the EvolvingRTN from the HfO2 device shown in Fig. 7, for two Rgain values. Specifically, the data is contained in the following csv files:
"Sim_TRNG_Circuit_HfO2_3_20s_Vth_210m_no_Noise_Ibias_11n.csv" (lower Rgain)
"Sim_TRNG_Circuit_HfO2_3_20s_Vth_210m_no_Noise_Ibias_4_8n.csv" (higher Rgain)
The result of the circuit simulations considering the temporary RTN from the SiO2 device shown in Fig. 8. Specifically, the data is contained in the following csv files:
"Sim_TRNG_Circuit_SiO2_1c_300s_Vth_180m_Noise_Ibias_1.5n.csv" (ref. Rgain)
"Sim_TRNG_Circuit_SiO2_1c_100s_200s_Vth_180m_Noise_Ibias_1.575n.csv" (lower Rgain)
"Sim_TRNG_Circuit_SiO2_1c_100s_200s_Vth_180m_Noise_Ibias_1.425n.csv" (higher Rgain)
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Project Description:
Title: Pandas Data Manipulation and File Conversion
Overview: This project aims to demonstrate the basic functionalities of Pandas, a powerful data manipulation library in Python. In this project, we will create a DataFrame, perform some data manipulation operations using Pandas, and then convert the DataFrame into both Excel and CSV formats.
Key Objectives:
Tools and Libraries Used:
Project Implementation:
DataFrame Creation:
Data Manipulation:
File Conversion:
to_excel() function.to_csv() function.Expected Outcome:
Upon completion of this project, you will have gained a fundamental understanding of how to work with Pandas DataFrames, perform basic data manipulation tasks, and convert DataFrames into different file formats. This knowledge will be valuable for data analysis, preprocessing, and data export tasks in various data science and analytics projects.
Conclusion:
The Pandas library offers powerful tools for data manipulation and file conversion in Python. By completing this project, you will have acquired essential skills that are widely applicable in the field of data science and analytics. You can further extend this project by exploring more advanced Pandas functionalities or integrating it into larger data processing pipelines.in this data we add number of data and make that data a data frame.and save in single excel file as different sheet name and then convert that excel file in csv file .
Facebook
Twitterhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html
Replication pack, FSE2018 submission #164: ------------------------------------------
**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: A Case Study of the PyPI Ecosystem **Note:** link to data artifacts is already included in the paper. Link to the code will be included in the Camera Ready version as well. Content description =================== - **ghd-0.1.0.zip** - the code archive. This code produces the dataset files described below - **settings.py** - settings template for the code archive. - **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset. This dataset only includes stats aggregated by the ecosystem (PyPI) - **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages themselves, which take around 2TB. - **build_model.r, helpers.r** - R files to process the survival data (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, `common.cache/survival_data.pypi_2008_2017-12_6.csv` in **dataset_full_Jan_2018.tgz**) - **Interview protocol.pdf** - approximate protocol used for semistructured interviews. - LICENSE - text of GPL v3, under which this dataset is published - INSTALL.md - replication guide (~2 pages)
Replication guide ================= Step 0 - prerequisites ---------------------- - Unix-compatible OS (Linux or OS X) - Python interpreter (2.7 was used; Python 3 compatibility is highly likely) - R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible) Depending on detalization level (see Step 2 for more details): - up to 2Tb of disk space (see Step 2 detalization levels) - at least 16Gb of RAM (64 preferable) - few hours to few month of processing time Step 1 - software ---------------- - unpack **ghd-0.1.0.zip**, or clone from gitlab: git clone https://gitlab.com/user2589/ghd.git git checkout 0.1.0 `cd` into the extracted folder. All commands below assume it as a current directory. - copy `settings.py` into the extracted folder. Edit the file: * set `DATASET_PATH` to some newly created folder path * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` - install docker. For Ubuntu Linux, the command is `sudo apt-get install docker-compose` - install libarchive and headers: `sudo apt-get install libarchive-dev` - (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools` Without this dependency, you might get an error on the next step, but it's safe to ignore. - install Python libraries: `pip install --user -r requirements.txt` . - disable all APIs except GitHub (Bitbucket and Gitlab support were not yet implemented when this study was in progress): edit `scraper/init.py`, comment out everything except GitHub support in `PROVIDERS`. Step 2 - obtaining the dataset ----------------------------- The ultimate goal of this step is to get output of the Python function `common.utils.survival_data()` and save it into a CSV file: # copy and paste into a Python console from common import utils survival_data = utils.survival_data('pypi', '2008', smoothing=6) survival_data.to_csv('survival_data.csv') Since full replication will take several months, here are some ways to speedup the process: ####Option 2.a, difficulty level: easiest Just use the precomputed data. Step 1 is not necessary under this scenario. - extract **dataset_minimal_Jan_2018.zip** - get `survival_data.csv`, go to the next step ####Option 2.b, difficulty level: easy Use precomputed longitudinal feature values to build the final table. The whole process will take 15..30 minutes. - create a folder `
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Canada Trademarks Dataset
18 Journal of Empirical Legal Studies 908 (2021), prepublication draft available at https://papers.ssrn.com/abstract=3782655, published version available at https://onlinelibrary.wiley.com/share/author/CHG3HC6GTFMMRU8UJFRR?target=10.1111/jels.12303
Dataset Selection and Arrangement (c) 2021 Jeremy Sheff
Python and Stata Scripts (c) 2021 Jeremy Sheff
Contains data licensed by Her Majesty the Queen in right of Canada, as represented by the Minister of Industry, the minister responsible for the administration of the Canadian Intellectual Property Office.
This individual-application-level dataset includes records of all applications for registered trademarks in Canada since approximately 1980, and of many preserved applications and registrations dating back to the beginning of Canada’s trademark registry in 1865, totaling over 1.6 million application records. It includes comprehensive bibliographic and lifecycle data; trademark characteristics; goods and services claims; identification of applicants, attorneys, and other interested parties (including address data); detailed prosecution history event data; and data on application, registration, and use claims in countries other than Canada. The dataset has been constructed from public records made available by the Canadian Intellectual Property Office. Both the dataset and the code used to build and analyze it are presented for public use on open-access terms.
Scripts are licensed for reuse subject to the Creative Commons Attribution License 4.0 (CC-BY-4.0), https://creativecommons.org/licenses/by/4.0/. Data files are licensed for reuse subject to the Creative Commons Attribution License 4.0 (CC-BY-4.0), https://creativecommons.org/licenses/by/4.0/, and also subject to additional conditions imposed by the Canadian Intellectual Property Office (CIPO) as described below.
Terms of Use:
As per the terms of use of CIPO's government data, all users are required to include the above-quoted attribution to CIPO in any reproductions of this dataset. They are further required to cease using any record within the datasets that has been modified by CIPO and for which CIPO has issued a notice on its website in accordance with its Terms and Conditions, and to use the datasets in compliance with applicable laws. These requirements are in addition to the terms of the CC-BY-4.0 license, which require attribution to the author (among other terms). For further information on CIPO’s terms and conditions, see https://www.ic.gc.ca/eic/site/cipointernet-internetopic.nsf/eng/wr01935.html. For further information on the CC-BY-4.0 license, see https://creativecommons.org/licenses/by/4.0/.
The following attribution statement, if included by users of this dataset, is satisfactory to the author, but the author makes no representations as to whether it may be satisfactory to CIPO:
The Canada Trademarks Dataset is (c) 2021 by Jeremy Sheff and licensed under a CC-BY-4.0 license, subject to additional terms imposed by the Canadian Intellectual Property Office. It contains data licensed by Her Majesty the Queen in right of Canada, as represented by the Minister of Industry, the minister responsible for the administration of the Canadian Intellectual Property Office. For further information, see https://creativecommons.org/licenses/by/4.0/ and https://www.ic.gc.ca/eic/site/cipointernet-internetopic.nsf/eng/wr01935.html.
Details of Repository Contents:
This repository includes a number of .zip archives which expand into folders containing either scripts for construction and analysis of the dataset or data files comprising the dataset itself. These folders are as follows:
If users wish to construct rather than download the datafiles, the first script that they should run is /py/sftp_secure.py. This script will prompt the user to enter their IP Horizons SFTP credentials; these can be obtained by registering with CIPO at https://ised-isde.survey-sondage.ca/f/s.aspx?s=59f3b3a4-2fb5-49a4-b064-645a5e3a752d&lang=EN&ds=SFTP. The script will also prompt the user to identify a target directory for the data downloads. Because the data archives are quite large, users are advised to create a target directory in advance and ensure they have at least 70GB of available storage on the media in which the directory is located.
The sftp_secure.py script will generate a new subfolder in the user’s target directory called /XML_raw. Users should note the full path of this directory, which they will be prompted to provide when running the remaining python scripts. Each of the remaining scripts, the filenames of which begin with “iterparse”, corresponds to one of the data files in the dataset, as indicated in the script’s filename. After running one of these scripts, the user’s target directory should include a /csv subdirectory containing the data file corresponding to the script; after running all the iterparse scripts the user’s /csv directory should be identical to the /csv directory in this repository. Users are invited to modify these scripts as they see fit, subject to the terms of the licenses set forth above.
With respect to the Stata do-files, only one of them is relevant to construction of the dataset itself. This is /do/CA_TM_csv_cleanup.do, which converts the .csv versions of the data files to .dta format, and uses Stata’s labeling functionality to reduce the size of the resulting files while preserving information. The other do-files generate the analyses and graphics presented in the paper describing the dataset (Jeremy N. Sheff, The Canada Trademarks Dataset, 18 J. Empirical Leg. Studies (forthcoming 2021)), available at https://papers.ssrn.com/abstract=3782655). These do-files are also licensed for reuse subject to the terms of the CC-BY-4.0 license, and users are invited to adapt the scripts to their needs.
The python and Stata scripts included in this repository are separately maintained and updated on Github at https://github.com/jnsheff/CanadaTM.
This repository also includes a copy of the current version of CIPO's data dictionary for its historical XML trademarks archive as of the date of construction of this dataset.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Arto (From Huggingface) [source]
The train.csv file contains a list of image filenames, captions, and the actual images used for training the image captioning models. Similarly, the test.csv file includes a separate set of image filenames, captions, and images specifically designated for testing the accuracy and performance of the trained models.
Furthermore, the valid.csv file contains a unique collection of image filenames with their respective captions and images that serve as an independent validation set to evaluate the models' capabilities accurately.
Each entry in these CSV files includes both a filename string that indicates the name or identifier of an image file stored in another location or directory. Additionally,** each entry also provides a list (or multiple rows) o**f strings representing written descriptions or captions describing each respective image given its filename.
Considering these details about this dataset's structure, it can be immensely valuable to researchers, developers, and enthusiasts working on developing innovative computer vision algorithms such as automatic text generation based on visual content analysis. Whether it's training machine learning models to automatically generate relevant captions based on new unseen images or evaluating existing systems' performance against diverse criteria.
Stay updated with cutting-edge research trends by leveraging this comprehensive dataset containing not only captio**ns but also corresponding imag**es across different sets specifically designed to cater to varied purposes within computer vision tasks. »
Overview of the Dataset
The dataset consists of three primary files:
train.csv,test.csv, andvalid.csv. These files contain information about image filenames and their respective captions. Each file includes multiple captions for each image to support diverse training techniques.Understanding the Files
- train.csv: This file contains filenames (
filenamecolumn) and their corresponding captions (captionscolumn) for training your image captioning model.- test.csv: The test set is included in this file, which contains a similar structure as that of
train.csv. The purpose of this file is to evaluate your trained models on unseen data.- valid.csv: This validation set provides images with their respective filenames (
filename) and captions (captions). It allows you to fine-tune your models based on performance during evaluation.Getting Started
To begin utilizing this dataset effectively, follow these steps:
- Extract the zip file containing all relevant data files onto your local machine or cloud environment.
- Familiarize yourself with each CSV file's structure:
train.csv,test.csv, andvalid.csv. Understand how information like filename(s) (filename) corresponds with its respective caption(s) (captions).- Depending on your specific use case or research goals, determine which portion(s) of the dataset you wish to work with (e.g., only train or train+validation).
- Load the dataset into your preferred programming environment or machine learning framework, ensuring you have the necessary dependencies installed.
- Preprocess the dataset as needed, such as resizing images to a specific dimension or encoding captions for model training purposes.
- Split the data into training, validation, and test sets according to your experimental design requirements.
- Use appropriate algorithms and techniques to train your image captioning models on the provided data.
Enhancing Model Performance
To optimize model performance using this dataset, consider these tips:
- Explore different architectures and pre-trained models specifically designed for image captioning tasks.
- Experiment with various natural language
- Image Captioning: This dataset can be used to train and evaluate image captioning models. The captions can be used as target labels for training, and the images can be paired with the captions to generate descriptive captions for test images.
- Image Retrieval: The dataset can be used for image retrieval tasks where given a query caption, the model needs to retrieve the images that best match the description. This can be useful in applications such as content-based image search.
- Natural Language Processing: The dataset can also be used for natural language processing tasks such as text generation or machine translation. The captions in this dataset are descriptive ...
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains all the citation data (in CSV format) included in COCI, released on 23 January 2023. In particular, each line of the CSV file defines a citation, and includes the following information:
[field "oci"] the Open Citation Identifier (OCI) for the citation; [field "citing"] the DOI of the citing entity; [field "cited"] the DOI of the cited entity; [field "creation"] the creation date of the citation (i.e. the publication date of the citing entity); [field "timespan"] the time span of the citation (i.e. the interval between the publication date of the cited entity and the publication date of the citing entity); [field "journal_sc"] it records whether the citation is a journal self-citations (i.e. the citing and the cited entities are published in the same journal); [field "author_sc"] it records whether the citation is an author self-citation (i.e. the citing and the cited entities have at least one author in common).
This version of the dataset contains:
1,463,920,523 citations; 77,045,952 bibliographic resources.
The size of the zipped archive is 37.5 GB, while the size of the unzipped CSV file is 238.5 GB.
Additional information about COCI can be found at the official webpage.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
overview: The corona jurisprudence of the Federal Constitutional Court (BVerfG-Corona) is a fully automated compilation of all decisions of the Federal Constitutional Court associated with the novel coronavirus (SARS-CoV-2). This data set is based on the corpus of the decisions of the Federal Constitutional Court (CE-BVerfG) and documents all decisions that contain the templates "Corona", "SARS-CoV" or "COVID" in full text. Please note the codebook of the CE-BVerfG! It contains important information on the correct use of the data set. It also helps in deciding which variant is best for you. I usually recommend the PDF collection for traditional research and the CSV files for quantitative research. The BVerfG-Corona does not contain a CSV variant of the decisions, please use the CE-BVerfG for this.
Update: This data set is updated approximately every 3 months during the corona pandemic. I always publish notifications about new and updated data sets promptly on Twitter at @FobbeSean. Key dataDeadline: January 8, 2021Scope of content: 56 decisions (version 2021-01-08)Formats: PDF and TXT Source code and compilation report. The entire creation process is fully automated and documented in detail from version 2021-01-08. With each compilation of the complete data set, a comprehensive compilation report is created in an attractively designed PDF format (similar to the codebook).The compilation report contains the complete source code, documents relevant calculation results, provides time stamps accurate to the second and is provided with a clickable table of contents. It is stored together with the source code. If you are interested in details of the creation process, please read this first. The complete source code is publicly available and permanently accessible in the scientific archive of CERN under this link: https://doi.org/10.5281/zenodo.4459416 Cryptographic signatures. The integrity and authenticity of the individual archives of the data record are ensured by a two-phase signature. In phase I, hash values are calculated for each ZIP archive during compilation using two different methods (SHA2-256 and SHA3-512) and documented in a CSV file. In phase II this CSV file is signed with my personal secret GPG key. This procedure ensures that the compilation can be carried out by anyone, especially in the context of replications, but that there is still personal guarantee of results. The CSV file with the hash checksums created during the compilation of the data set is provided with my personal GPG signature. The public key corresponding to this version is stored with both the data set and the source code. It has the following characteristics: Name: Sean Fobbe (fobbe-data@posteo.de)Fingerprint: FE6F B888 F0E5 656C 1D25 3B9A 50C4 1384 F44A 4E42 No copyright: public domainIn accordance with Section 5 (1) UrhG, there is no copyright in the decision-making texts and official guidelines, as they are official works. § 5 UrhG is to be applied analogously to official databases (BGH, decision of September 28, 2006 - I ZR 261/03, "Saxon tendering service"). All my own contributions (e.g. by compiling and adapting the metadata) and thus the entire data set are completely copyright-free in accordance with a CC0 1.0 Universal Public Domain License. DisclaimerThis data set is a private scientific initiative and has no connection with the Federal Constitutional Court or with the editors of the BVerfGE. Further Open Access Publications (Fobbe)You can find more open data here: https://zenodo.org/communities/sean-fobbe-data/You can find my code repository here: https://zenodo.org/communities/sean-fobbe-code/Full texts of regular publications are available here: https://zenodo.org/communities/sean-fobbe-publications/ ContactFound a mistake? Suggestions? Either report this in the issue tracker on GitHub or send me an email to fobbe-data@posteo.de
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
a renewable energy resource-based sustainable microgrid model for a residential area is designed by HOMER PRO microgrid software. A small-sized residential area of 20 buildings of about 60 families with 219 MWh and an electric vehicle charging station of daily 10 batteries with 18.3MWh annual energy consumption considered for Padma residential area, Rajshahi (24°22.6'N, 88°37.2'E) is selected as our case study. Solar panels, natural gas generator, inverter and Li-ion batteries are required for our proposed model. The HOMER PRO microgrid software is used to optimize our designed microgrid model. Data were collected from HOMER PRO for the year 2007. We have compared our daily load demand 650KW with the results varying the load by 10%, 5%, 2.5% more and less to find out the best case according to our demand. We have a total of 7 different datasets for different load conditions where each dataset contains a total of 8760 sets of data having 6 different parameters for each set. Data file contents: Data 1:: original_load.csv: This file contains data for 650KW load demand. The dataset contains a total of 8760 sets of data having 6 different parameters for each set. Data arrangement is given below: Column 1: Date and time of data recording in the format of MM-DD- YYYY [hh]:[mm]. Time is in 24-hour format. Column 2: Solar power output in KW unit. Column 3: Generator power output in KW unit. Column 4: Total Electrical load served in KW unit. Column 5: Excess electrical production in KW unit. Column 6: Li-ion battery energy content in KWh unit. Column 7: Li-ion battery state of charge in % unit.
Data 2:: 2.5%_more_load.csv: This file contains data for 677KW load demand. The dataset contains a total of 8760 sets of data having 6 different parameters for each set. Column information is the same for every dataset.
Data 3:: 2.5%_less_load.csv: This file contains data for 622KW load demand. The dataset contains a total of 8760 sets of data having 6 different parameters for each set. Column information is the same for every dataset.
Data 4:: 5%_more_load.csv: This file contains data for 705KW load demand. The dataset contains a total of 8760 sets of data having 6 different parameters for each set. Column information is the same for every dataset. Data 5:: 5%_less_load.csv: This file contains data for 595KW load demand. The dataset contains a total of 8760 sets of data having 6 different parameters for each set. Column information is the same for every dataset. Data 6:: 10%_more_load.csv: This file contains data for the 760KW load demand. The dataset contains a total of 8760 sets of data having 6 different parameters for each set. Column information is the same for every dataset. Data 7:: 10%_less_load.csv: This file contains data for 540KW load demand. The dataset contains a total of 8760 sets of data having 6 different parameters for each set. Column information is the same for every dataset.
Facebook
Twitterhttps://choosealicense.com/licenses/unlicense/https://choosealicense.com/licenses/unlicense/
I used my notebook here to generate CSV files for Windows event codes: https://github.com/whit3rabbit/Windows-Event-Codes-CSV CSV is here: https://github.com/whit3rabbit/Windows-Event-Codes-CSV/blob/main/updated_detailed_events.csv Converted each line to markdown and used it to generate Questions and Answers. These have not been vetted for accuracy so use with caution.
Facebook
TwitterU.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
The NRCS National Water and Climate Center Report Generator web-based application uses long-term snowpack, precipitation, reservoir, streamflow, and soils data from a variety of quality-controlled sources to create reports. Users can choose from predefined templates or build custom reports. Data from tabular reports may be exported to different formats, including comma-separated value (CSV) files. Charts can be saved to graphics formats such as JPG and PNG. The Report Generator network incorporates data from many agency databases. The NRCS snow survey flagship database, the Water and Climate Information System (WCIS), provides a wealth of data, including manually-collected snow course data and information from automated Snow Telemetry (SNOTEL) and Soil Climate Analysis Network (SCAN) stations across the United States. Report Generator also uses precipitation, streamflow, and reservoir data from the U.S. Army Corps of Engineers (USACE), the U.S. Bureau of Reclamation (BOR), the Applied Climate Information System (ACIS), the U.S. Geological Survey (USGS), various water districts and other entities. In addition to creating reports, Report Generator lets you view information on sites, including metadata, such as elevation, latitude/longitude and hydrologic unit code (HUC). You can also view photos of the site, including a site map (in Google maps when available). Report Generator creates reports in both tabular and chart format. Single-station and multiple-station charting is also supported. Data may be displayed in either English or Metric units. Farmers, municipalities, water and hydroelectric utilities, environmental organizations, fish and wildlife managers, tribal nations, reservoir managers, recreationists, wetlands managers, urban developers, transportation departments, and research organizations regularly use these data and products. This release has several new features which focus on improving the way reports are specified and how they are displayed. Multi-station charting is also supported in this release. Resources in this dataset:Resource Title: Report Generator 2.0. File Name: Web Page, url: https://wcc.sc.egov.usda.gov/reportGenerator/ Create custom reports and charts from multiple data sources. Data from tabular reports may be exported to different formats, including comma-separated value (CSV) files. Charts can be saved to graphics formats, such as JPG and PNG.
Facebook
TwitterThis dataset includes all the data and R code needed to reproduce the analyses in a forthcoming manuscript:Copes, W. E., Q. D. Read, and B. J. Smith. Environmental influences on drying rate of spray applied disinfestants from horticultural production services. PhytoFrontiers, DOI pending.Study description: Instructions for disinfestants typically specify a dose and a contact time to kill plant pathogens on production surfaces. A problem occurs when disinfestants are applied to large production areas where the evaporation rate is affected by weather conditions. The common contact time recommendation of 10 min may not be achieved under hot, sunny conditions that promote fast drying. This study is an investigation into how the evaporation rates of six commercial disinfestants vary when applied to six types of substrate materials under cool to hot and cloudy to sunny weather conditions. Initially, disinfestants with low surface tension spread out to provide 100% coverage and disinfestants with high surface tension beaded up to provide about 60% coverage when applied to hard smooth surfaces. Disinfestants applied to porous materials were quickly absorbed into the body of the material, such as wood and concrete. Even though disinfestants evaporated faster under hot sunny conditions than under cool cloudy conditions, coverage was reduced considerably in the first 2.5 min under most weather conditions and reduced to less than or equal to 50% coverage by 5 min. Dataset contents: This dataset includes R code to import the data and fit Bayesian statistical models using the model fitting software CmdStan, interfaced with R using the packages brms and cmdstanr. The models (one for 2022 and one for 2023) compare how quickly different spray-applied disinfestants dry, depending on what chemical was sprayed, what surface material it was sprayed onto, and what the weather conditions were at the time. Next, the statistical models are used to generate predictions and compare mean drying rates between the disinfestants, surface materials, and weather conditions. Finally, tables and figures are created. These files are included:Drying2022.csv: drying rate data for the 2022 experimental runWeather2022.csv: weather data for the 2022 experimental runDrying2023.csv: drying rate data for the 2023 experimental runWeather2023.csv: weather data for the 2023 experimental rundisinfestant_drying_analysis.Rmd: RMarkdown notebook with all data processing, analysis, and table creation codedisinfestant_drying_analysis.html: rendered output of notebookMS_figures.R: additional R code to create figures formatted for journal requirementsfit2022_discretetime_weather_solar.rds: fitted brms model object for 2022. This will allow users to reproduce the model prediction results without having to refit the model, which was originally fit on a high-performance computing clusterfit2023_discretetime_weather_solar.rds: fitted brms model object for 2023data_dictionary.xlsx: descriptions of each column in the CSV data files
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By wiki_bio (From Huggingface) [source]
The dataset contains several key columns: input_text and target_text. The input_text column includes the infobox and first paragraph of a Wikipedia biography, providing essential information about the individual's background, accomplishments, and notable features. The target_text column consists of the complete biography text extracted from the corresponding Wikipedia page.
In order to facilitate model training and validation, the dataset is divided into three main files: train.csv, val.csv, and test.csv. The train.csv file contains pairs of input text and target text for model training. It serves as a fundamental resource to develop accurate language generation models by providing abundant examples for learning to generate coherent biographical texts.
The val.csv file provides further validation data consisting of additional Wikipedia biographies with their corresponding infoboxes and first paragraphs. This subset allows researchers to evaluate their trained models' performance on unseen examples during development or fine-tuning stages.
Finally, the test.csv file offers a separate set of input texts paired with corresponding target texts for generating complete biographies using pre-trained models or newly developed algorithms. The purpose of this file is to benchmark system performance on unseen data in order to assess generalization capabilities.
This extended description aims to provide an informative overview of the dataset structure, its intended use cases in natural language processing research tasks such as text generation or summarization. Researchers can leverage this comprehensive collection to advance various applications in automatic biography writing systems or content generation tasks that require coherent textual output based on provided partial information extracted from an infobox or initial paragraph sources from online encyclopedias like Wikipedia
Overview:
- This dataset consists of biographical information from Wikipedia pages, specifically the infobox and the first paragraph of each biography.
- The dataset is provided in three separate files: train.csv, val.csv, and test.csv.
- Each file contains pairs of input text and target text.
File Descriptions:
- train.csv: This file is used for training purposes. It includes pairs of input text (infobox and first paragraph) and target text (complete biography).
- val.csv: Validation purposes can be fulfilled using this file. It contains a collection of biographies with infobox and first paragraph texts.
- test.csv: This file can be used to generate complete biographies based on the given input texts.
Column Information:
a) For train.csv:
input_text: Input text column containing the infobox and first paragraph of a Wikipedia biography.target_text: Target text column containing the complete biography text for each entry.b) For val.csv: -
input_text: Infobox and first paragraph texts are included in this column. -target_text: Complete biography texts are present in this column.c) For test.csv: The columns follow the pattern mentioned previously, i.e.,
input_textfollowed bytarget_text.Usage Guidelines:
Training Model or Algorithm Development: If you are working on training a model or developing an algorithm for generating complete biographies from given inputs, it is recommended to use train.csv as your primary dataset.
Model Validation or Evaluation: To validate or evaluate your trained model, you can use val.csv as an independent dataset. This dataset contains biographies that have been withheld from the training data.
Generating Biographies with Trained Models: To generate complete biographies using your trained model, you can make use of test.csv. This dataset provides input texts for which you need to generate the corresponding target texts.
Additional Information and Tips:
The input text in this dataset includes both an infobox (a structured section containing key-value pairs) and the first paragraph of a Wikipedia biography.
The target text is the complete biography for each entry.
While working with this dataset, make sure to preprocess and
- Text Generation: The dataset can be used to train language models to generate complete Wikipedia biographies given only the infobox and first paragraph ...
Facebook
TwitterOpen Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Programming Languages Infrastructure as Code (PL-IaC) enables IaC programs written in general-purpose programming languages like Python and TypeScript. The currently available PL-IaC solutions are Pulumi and the Cloud Development Kits (CDKs) of Amazon Web Services (AWS) and Terraform. This dataset provides metadata and initial analyses of all public GitHub repositories in August 2022 with an IaC program, including their programming languages, applied testing techniques, and licenses. Further, we provide a shallow copy of the head state of those 7104 repositories whose licenses permit redistribution. The dataset is available under the Open Data Commons Attribution License (ODC-By) v1.0. Contents:
metadata.zip: The dataset metadata and analysis results as CSV files. scripts-and-logs.zip: Scripts and logs of the dataset creation. LICENSE: The Open Data Commons Attribution License (ODC-By) v1.0 text. README.md: This document. redistributable-repositiories.zip: Shallow copies of the head state of all redistributable repositories with an IaC program. This artifact is part of the ProTI Infrastructure as Code testing project: https://proti-iac.github.io. Metadata The dataset's metadata comprises three tabular CSV files containing metadata about all analyzed repositories, IaC programs, and testing source code files. repositories.csv:
ID (integer): GitHub repository ID url (string): GitHub repository URL downloaded (boolean): Whether cloning the repository succeeded name (string): Repository name description (string): Repository description licenses (string, list of strings): Repository licenses redistributable (boolean): Whether the repository's licenses permit redistribution created (string, date & time): Time of the repository's creation updated (string, date & time): Time of the last update to the repository pushed (string, date & time): Time of the last push to the repository fork (boolean): Whether the repository is a fork forks (integer): Number of forks archive (boolean): Whether the repository is archived programs (string, list of strings): Project file path of each IaC program in the repository programs.csv:
ID (string): Project file path of the IaC program repository (integer): GitHub repository ID of the repository containing the IaC program directory (string): Path of the directory containing the IaC program's project file solution (string, enum): PL-IaC solution of the IaC program ("AWS CDK", "CDKTF", "Pulumi") language (string, enum): Programming language of the IaC program (enum values: "csharp", "go", "haskell", "java", "javascript", "python", "typescript", "yaml") name (string): IaC program name description (string): IaC program description runtime (string): Runtime string of the IaC program testing (string, list of enum): Testing techniques of the IaC program (enum values: "awscdk", "awscdk_assert", "awscdk_snapshot", "cdktf", "cdktf_snapshot", "cdktf_tf", "pulumi_crossguard", "pulumi_integration", "pulumi_unit", "pulumi_unit_mocking") tests (string, list of strings): File paths of IaC program's tests testing-files.csv:
file (string): Testing file path language (string, enum): Programming language of the testing file (enum values: "csharp", "go", "java", "javascript", "python", "typescript") techniques (string, list of enum): Testing techniques used in the testing file (enum values: "awscdk", "awscdk_assert", "awscdk_snapshot", "cdktf", "cdktf_snapshot", "cdktf_tf", "pulumi_crossguard", "pulumi_integration", "pulumi_unit", "pulumi_unit_mocking") keywords (string, list of enum): Keywords found in the testing file (enum values: "/go/auto", "/testing/integration", "@AfterAll", "@BeforeAll", "@Test", "@aws-cdk", "@aws-cdk/assert", "@pulumi.runtime.test", "@pulumi/", "@pulumi/policy", "@pulumi/pulumi/automation", "Amazon.CDK", "Amazon.CDK.Assertions", "Assertions_", "HashiCorp.Cdktf", "IMocks", "Moq", "NUnit", "PolicyPack(", "ProgramTest", "Pulumi", "Pulumi.Automation", "PulumiTest", "ResourceValidationArgs", "ResourceValidationPolicy", "SnapshotTest()", "StackValidationPolicy", "Testing", "Testing_ToBeValidTerraform(", "ToBeValidTerraform(", "Verifier.Verify(", "WithMocks(", "[Fact]", "[TestClass]", "[TestFixture]", "[TestMethod]", "[Test]", "afterAll(", "assertions", "automation", "aws-cdk-lib", "aws-cdk-lib/assert", "aws_cdk", "aws_cdk.assertions", "awscdk", "beforeAll(", "cdktf", "com.pulumi", "def test_", "describe(", "github.com/aws/aws-cdk-go/awscdk", "github.com/hashicorp/terraform-cdk-go/cdktf", "github.com/pulumi/pulumi", "integration", "junit", "pulumi", "pulumi.runtime.setMocks(", "pulumi.runtime.set_mocks(", "pulumi_policy", "pytest", "setMocks(", "set_mocks(", "snapshot", "software.amazon.awscdk.assertions", "stretchr", "test(", "testing", "toBeValidTerraform(", "toMatchInlineSnapshot(", "toMatchSnapshot(", "to_be_valid_terraform(", "unittest", "withMocks(") program (string): Project file path of the testing file's IaC program Dataset Creation scripts-and-logs.zip contains all scripts and logs of the creation of this dataset. In it, executions/executions.log documents the commands that generated this dataset in detail. On a high level, the dataset was created as follows:
A list of all repositories with a PL-IaC program configuration file was created using search-repositories.py (documented below). The execution took two weeks due to the non-deterministic nature of GitHub's REST API, causing excessive retries. A shallow copy of the head of all repositories was downloaded using download-repositories.py (documented below). Using analysis.ipynb, the repositories were analyzed for the programs' metadata, including the used programming languages and licenses. Based on the analysis, all repositories with at least one IaC program and a redistributable license were packaged into redistributable-repositiories.zip, excluding any node_modules and .git directories. Searching Repositories The repositories are searched through search-repositories.py and saved in a CSV file. The script takes these arguments in the following order:
Github access token. Name of the CSV output file. Filename to search for. File extensions to search for, separated by commas. Min file size for the search (for all files: 0). Max file size for the search or * for unlimited (for all files: *). Pulumi projects have a Pulumi.yaml or Pulumi.yml (case-sensitive file name) file in their root folder, i.e., (3) is Pulumi and (4) is yml,yaml. https://www.pulumi.com/docs/intro/concepts/project/ AWS CDK projects have a cdk.json (case-sensitive file name) file in their root folder, i.e., (3) is cdk and (4) is json. https://docs.aws.amazon.com/cdk/v2/guide/cli.html CDK for Terraform (CDKTF) projects have a cdktf.json (case-sensitive file name) file in their root folder, i.e., (3) is cdktf and (4) is json. https://www.terraform.io/cdktf/create-and-deploy/project-setup Limitations The script uses the GitHub code search API and inherits its limitations:
Only forks with more stars than the parent repository are included. Only the repositories' default branches are considered. Only files smaller than 384 KB are searchable. Only repositories with fewer than 500,000 files are considered. Only repositories that have had activity or have been returned in search results in the last year are considered. More details: https://docs.github.com/en/search-github/searching-on-github/searching-code The results of the GitHub code search API are not stable. However, the generally more robust GraphQL API does not support searching for files in repositories: https://stackoverflow.com/questions/45382069/search-for-code-in-github-using-graphql-v4-api Downloading Repositories download-repositories.py downloads all repositories in CSV files generated through search-respositories.py and generates an overview CSV file of the downloads. The script takes these arguments in the following order:
Name of the repositories CSV files generated through search-repositories.py, separated by commas. Output directory to download the repositories to. Name of the CSV output file. The script only downloads a shallow recursive copy of the HEAD of the repo, i.e., only the main branch's most recent state, including submodules, without the rest of the git history. Each repository is downloaded to a subfolder named by the repository's ID.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is part of a framework for testing novelty-aware agents in the Smart Environment domain. See the associated GitHub repository here for more information. The agent is evaluated on a set of trials for each type of novelty. Each trial provides a sequence of episodes, where each episode is one day in the life of the smart environment (i.e., smart home). At some point in the sequence of episodes, a novelty is introduced and persists for the remainder of the trial. There are 8 levels of novelty (again, see the repo referenced above), including trials with no novelty (level 0). Furthermore, each novelty level is categorized as easy, medium, or hard difficulty based on the likely impact of the novelty on the agent's performance. The agent is evaluated on 30 trials for each novelty level/difficulty, and each trial consists of 200 episodes. Data for the various novelty levels are available at the following links due to their size.
Novelty Levels 0-5: Trials and episodes used to evaluate the agent on novelty levels 0-5. File: smartenv_012345.zip.
Novelty Levels 0,6-8: Trials and episodes used to evaluate the agent on novelty levels 0, 6, 7 and 8. The level 0 trials are different than the above level 0 trials, but are added here so that level 0 trials can be included along with evaluation of level 6-8, if desired. File: smartenv_0678.zip.
Training data: Training data for each of the three smart environment testbeds. These three CSV files should be put in the same location as the unzipped folders above. File: smartenv_training.zip.
Facebook
TwitterThis data package is associated with the publication “On the Transferability of Residence Time Distributions in Two 10-km Long River Sections with Similar Hydromorphic Units” submitted to the Journal of Hydrology (Bao et al. 2024). Quantifying hydrologic exchange fluxes (HEFs) at the stream-groundwater interface, along with their residence time distributions (RTDs) in the subsurface, is crucial for managing water quality and ecosystem health in dynamic river corridors. However, directly simulating high-spatial resolution HEFs and RTDs can be a time-consuming process, particularly for watershed-scale modeling. Efficient surrogate models that link RTDs to hydromorphic units (HUs) may serve as alternatives for simulating RTDs in large-scale models. One common concern with these surrogate models, however, is the transferability of the relationship between the RTDs and HUs from one river corridor to another. To address this, we evaluated the HEFs and the resulting RTD-HU relationships for two 10-kilometer-long river corridors along the Columbia River, using a one-way coupled three-dimensional transient surface-subsurface water transport modeling framework that we previously developed. Applying this framework to the two river corridors with similar HUs allows for quantitative comparisons of HEFs and RTDs using both statistical tests and machine learning classification models. This data package includes the model inputs files and the simulation results data. This data package contains 10 folders. The modeling simulation results data are in the folders 100H_pt_data and 300area_pt_data, for the study domain Hanford 100H and 300 area respectively. The remaining eight folders contain the scripts and data to generate the manuscript figures. The file-level metadata file (Bao_2024_Residence_Time_Distribution _flmd.csv) includes a list of all files contained in this data package and descriptions for each. The data dictionary file (Bao_2024_Residence_Time_Distribution _dd.csv) includes column header definitions and units of all tabular files.
Facebook
Twitterhttps://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
According to our latest research, the global CSV Editor market size reached USD 1.32 billion in 2024, reflecting the growing integration of data-centric processes across numerous industries. The market is experiencing robust expansion, supported by a CAGR of 10.4% from 2025 to 2033. By the end of the forecast period in 2033, the CSV Editor market is anticipated to achieve a value of USD 3.16 billion. The primary growth factor driving this market is the increasing reliance on structured data formats for business intelligence, analytics, and process automation, which has led to a surge in demand for advanced CSV editing tools globally.
The growth of the CSV Editor market is underpinned by the digital transformation initiatives adopted by enterprises across sectors such as BFSI, healthcare, IT, and retail. As organizations generate and handle exponentially larger volumes of data, the need to efficiently manage, clean, and manipulate CSV files has become crucial. CSV Editors, which enable users to modify, validate, and visualize large datasets, are now considered essential for data-driven decision-making. The proliferation of cloud computing and the rise of big data analytics have further accentuated the importance of robust CSV editing solutions, as businesses seek to streamline workflows and enhance productivity.
Another significant growth driver is the increasing adoption of automation and artificial intelligence in data management processes. Modern CSV Editors are evolving from simple file manipulation tools to sophisticated platforms that support scripting, automation, and integration with other enterprise software. This evolution is particularly evident in industries such as healthcare and finance, where the accuracy and consistency of data are paramount. The availability of both on-premises and cloud-based deployment modes has also broadened the market’s appeal, catering to organizations with varying security and compliance requirements. Furthermore, the growing trend of remote work and distributed teams has fueled demand for web-based CSV Editors that facilitate real-time collaboration and seamless access from multiple devices.
The CSV Editor market is also benefitting from the increasing focus on data governance and regulatory compliance. As governments and regulatory bodies implement stricter data protection laws, organizations are compelled to invest in tools that ensure data integrity and traceability. CSV Editors play a pivotal role in maintaining audit trails, validating data formats, and supporting compliance with standards such as GDPR and HIPAA. This regulatory backdrop, combined with the rise in cyber threats and data breaches, has made secure and feature-rich CSV Editors a necessity for enterprises seeking to mitigate risks and safeguard sensitive information.
Regionally, North America dominates the CSV Editor market, accounting for the largest revenue share in 2024, driven by the presence of leading technology firms and widespread adoption of data management solutions. Europe follows closely, with strong demand from the BFSI and healthcare sectors. The Asia Pacific region is emerging as a high-growth market, fueled by rapid digitization, expanding IT infrastructure, and increased investments in data analytics. Latin America and the Middle East & Africa are also witnessing steady growth, albeit from a smaller base, as organizations in these regions gradually embrace digital transformation and modern data management practices.
The CSV Editor market is segmented by component into Software and Services, with software solutions representing the lion's share of the market in 2024. The software segment encompasses standalone CSV editing applications, integrated development environments (IDEs), and plug-ins that facilitate the manipulation and validation of CSV files. These solutions are in high demand due to their ability to handle large datasets, support complex data transformations, and provide user-friendly interfaces for both technical and non-technical users. The continuous evolution of software features, such as real-time collaboration, version control, and advanced data visualization, is further propelling the adoption of CSV Editor software across industries.
The services segment, while smaller in comparison, is gaining traction as organizations seek
Facebook
TwitterThis data release supports an analysis of changes in dissolved organic carbon (DOC) and nitrate concentrations in Buck Creek watershed near Inlet, New York 2001 to 2021. The Buck Creek watershed is a 310-hectare forested watershed that is recovering from acidic deposition within the Adirondack region. The data release includes pre-processed model inputs and model outputs for the Weighted Regressions on Time, Discharge and Season (WRTDS) model (Hirsch and others, 2010) to estimate daily flow normalized concentrations of DOC and nitrate during a 20-year period of analysis. WRTDS uses daily discharge and concentration observations implemented through the Exploration and Graphics for River Trends R package (EGRET) to predict solute concentration using decimal time and discharge as explanatory variables (Hirsch and De Cicco, 2015; Hirsch and others, 2010). Discharge and concentration data are available from the U.S. Geological Survey National Water Information System (NWIS) database (U.S. Geological Survey, 2016). The time series data were analyzed for the entire period, water years 2001 (WY2001) to WY2021 where WY2001 is the period from October 1, 2000 to September 30, 2001. This data release contains 5 comma-separated values (CSV) files, one R script, and one XML metadata file. There are four input files (“Daily.csv”, “INFO.csv”, “Sample_doc.csv”, and “Sample_nitrate.csv”) that contain site information, daily mean discharge, and mean daily DOC or nitrate concentrations. The R script (“Buck Creek WRTDS R script.R”) uses the four input datasets and functions from the EGRET R package to generate estimations of flow normalized concentrations. The output file (“WRTDS_results.csv”) contains model output at daily time steps for each sub-watershed and for each solute. Files are automatically associated with the R script when opened in RStudio using the provided R project file ("Files.Rproj"). All input, output, and R files are in the "Files.zip" folder.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is based on the TravisTorrent dataset released 2017-01-11 (https://travistorrent.testroots.org), the Google BigQuery GHTorrent dataset accessed 2017-07-03, and the Git log history of all projects in the dataset, retrieved 2017-07-16 and 2017-07-17.
We selected projects hosted on GitHub that employ the Continuous Integration (CI) system Travis CI. We identified the projects using the TravisTorrent data set and considered projects that:
used GitHub from the beginning (first commit not more than seven days before project creation date according to GHTorrent),
were active for at least one year (365 days) before the first build with Travis CI (before_ci),
used Travis CI at least for one year (during_ci),
had commit or merge activity on the default branch in both of these phases, and
used the default branch to trigger builds.
To derive the time frames, we employed the GHTorrent Big Query data set. The resulting sample contains 113 projects. Of these projects, 89 are Ruby projects and 24 are Java projects. For our analysis, we only consider the activity one year before and after the first build.
We cloned the selected project repositories and extracted the version history for all branches (see https://github.com/sbaltes/git-log-parser). For each repo and branch, we created one log file with all regular commits and one log file with all merges. We only considered commits changing non-binary files and applied a file extension filter to only consider changes to Java or Ruby source code files. From the log files, we then extracted metadata about the commits and stored this data in CSV files (see https://github.com/sbaltes/git-log-parser).
We also retrieved a random sample of GitHub project to validate the effects we observed in the CI project sample. We only considered projects that:
have Java or Ruby as their project language
used GitHub from the beginning (first commit not more than seven days before project creation date according to GHTorrent)
have commit activity for at least two years (730 days)
are engineered software projects (at least 10 watchers)
were not in the TravisTorrent dataset
In total, 8,046 projects satisfied those constraints. We drew a random sample of 800 projects from this sampling frame and retrieved the commit and merge data in the same way as for the CI sample. We then split the development activity at the median development date, removed projects without commits or merges in either of the two resulting time spans, and then manually checked the remaining projects to remove the ones with CI configuration files. The final comparision sample contained 60 non-CI projects.
This dataset contains the following files:
tr_projects_sample_filtered_2.csv A CSV file with information about the 113 selected projects.
tr_sample_commits_default_branch_before_ci.csv tr_sample_commits_default_branch_during_ci.csv One CSV file with information about all commits to the default branch before and after the first CI build. Only commits modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the following columns:
project: GitHub project name ("/" replaced by "_"). branch: The branch to which the commit was made. hash_value: The SHA1 hash value of the commit. author_name: The author name. author_email: The author email address. author_date: The authoring timestamp. commit_name: The committer name. commit_email: The committer email address. commit_date: The commit timestamp. log_message_length: The length of the git commit messages (in characters). file_count: Files changed with this commit. lines_added: Lines added to all files changed with this commit. lines_deleted: Lines deleted in all files changed with this commit. file_extensions: Distinct file extensions of files changed with this commit.
tr_sample_merges_default_branch_before_ci.csv tr_sample_merges_default_branch_during_ci.csv One CSV file with information about all merges into the default branch before and after the first CI build. Only merges modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the following columns:
project: GitHub project name ("/" replaced by "_"). branch: The destination branch of the merge. hash_value: The SHA1 hash value of the merge commit. merged_commits: Unique hash value prefixes of the commits merged with this commit. author_name: The author name. author_email: The author email address. author_date: The authoring timestamp. commit_name: The committer name. commit_email: The committer email address. commit_date: The commit timestamp. log_message_length: The length of the git commit messages (in characters). file_count: Files changed with this commit. lines_added: Lines added to all files changed with this commit. lines_deleted: Lines deleted in all files changed with this commit. file_extensions: Distinct file extensions of files changed with this commit. pull_request_id: ID of the GitHub pull request that has been merged with this commit (extracted from log message). source_user: GitHub login name of the user who initiated the pull request (extracted from log message). source_branch : Source branch of the pull request (extracted from log message).
comparison_project_sample_800.csv A CSV file with information about the 800 projects in the comparison sample.
commits_default_branch_before_mid.csv commits_default_branch_after_mid.csv One CSV file with information about all commits to the default branch before and after the medium date of the commit history. Only commits modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the same columns as the commits tables described above.
merges_default_branch_before_mid.csv merges_default_branch_after_mid.csv One CSV file with information about all merges into the default branch before and after the medium date of the commit history. Only merges modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the same columns as the merge tables described above.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Categorical scatterplots with R for biologists: a step-by-step guide
Benjamin Petre1, Aurore Coince2, Sophien Kamoun1
1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK
Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.
Protocol
• Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.
• Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.
• Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.
Notes
• Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.
• Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.
replicates
graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()
References
Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.
Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035
Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Co-Creation Database groups scientific references on co-creation.
It mainly contains the title, abstract, DOI, and authors.
Two versions are available:
To perform your search: we recommend you extend your search to the title and abstract since some data are initially missing. The RIS file can be uploaded to any references manager (e.g., Zotero, Mendeley, etc.), where you will have the feature to search. For example, here is the link for advanced search instructions in Zotero: https://www.zotero.org/support/searching. Additionally, You can run a Boolean search for CSV files using a Python script.
To improve the database in further updates: we make available an online form to submit any irrelevant references you may find or to submit any relevant reference not inside the last version. The form is available at the following link: https://forms.office.com/e/6vu9X0kBcw
It was produced as part of Health CASCADE, a Marie Skłodowska-Curie Innovative Training Network funded by the European Union's Horizon 2020 research and innovation programme under Marie Skłodowska-Curie grant agreement n° 956501.
The work is made available under the terms of the license CC-BY-NC-4.0 (Creative Common Attribution - NonCommercial - NoDerivatives 4.0 International).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository includes MATLAB files and datasets related to the IEEE IIRW 2023 conference proceeding:T. Zanotti et al., "Reliability Analysis of Random Telegraph Noisebased True Random Number Generators," 2023 IEEE International Integrated Reliability Workshop (IIRW), South Lake Tahoe, CA, USA, 2023, pp. 1-6, doi: 10.1109/IIRW59383.2023.10477697
The repository includes:
The data of the bitmaps reported in Fig. 4, i.e., the results of the simulation of the ideal RTN-based TRNG circuit for different reseeding strategies. To load and plot the data use the "plot_bitmaps.mat" file.
The result of the circuit simulations considering the EvolvingRTN from the HfO2 device shown in Fig. 7, for two Rgain values. Specifically, the data is contained in the following csv files:
"Sim_TRNG_Circuit_HfO2_3_20s_Vth_210m_no_Noise_Ibias_11n.csv" (lower Rgain)
"Sim_TRNG_Circuit_HfO2_3_20s_Vth_210m_no_Noise_Ibias_4_8n.csv" (higher Rgain)
The result of the circuit simulations considering the temporary RTN from the SiO2 device shown in Fig. 8. Specifically, the data is contained in the following csv files:
"Sim_TRNG_Circuit_SiO2_1c_300s_Vth_180m_Noise_Ibias_1.5n.csv" (ref. Rgain)
"Sim_TRNG_Circuit_SiO2_1c_100s_200s_Vth_180m_Noise_Ibias_1.575n.csv" (lower Rgain)
"Sim_TRNG_Circuit_SiO2_1c_100s_200s_Vth_180m_Noise_Ibias_1.425n.csv" (higher Rgain)