100+ datasets found
  1. h

    AI_World_Generator_csv

    • huggingface.co
    Updated Aug 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jensin (2025). AI_World_Generator_csv [Dataset]. https://huggingface.co/datasets/Jensin/AI_World_Generator_csv
    Explore at:
    Dataset updated
    Aug 3, 2025
    Authors
    Jensin
    Area covered
    World
    Description

    Jensin/AI_World_Generator_csv dataset hosted on Hugging Face and contributed by the HF Datasets community

  2. Z

    Data from: Reliability Analysis of Random Telegraph Noisebased True Random...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Sep 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zanotti, Tommaso; Ranjan, Alok; O'Shea, Sean J.; Raghavan, Nagarajan; Thamankar, Dr. Ramesh; Pey, Kin Leong; PUGLISI, Francesco Maria (2024). Reliability Analysis of Random Telegraph Noisebased True Random Number Generators [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13169457
    Explore at:
    Dataset updated
    Sep 30, 2024
    Dataset provided by
    University of Modena and Reggio Emilia
    Università degli Studi di Modena e Reggio Emilia
    Singapore University of Technology and Design
    Chalmers University of Technology
    Agency for Science, Technology and Research
    VIT University
    Authors
    Zanotti, Tommaso; Ranjan, Alok; O'Shea, Sean J.; Raghavan, Nagarajan; Thamankar, Dr. Ramesh; Pey, Kin Leong; PUGLISI, Francesco Maria
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    • Repository author: Tommaso Zanotti* email: tommaso.zanotti@unimore.it or francescomaria.puglisi@unimore.it * Version v1.0

    This repository includes MATLAB files and datasets related to the IEEE IIRW 2023 conference proceeding:T. Zanotti et al., "Reliability Analysis of Random Telegraph Noisebased True Random Number Generators," 2023 IEEE International Integrated Reliability Workshop (IIRW), South Lake Tahoe, CA, USA, 2023, pp. 1-6, doi: 10.1109/IIRW59383.2023.10477697

    The repository includes:

    The data of the bitmaps reported in Fig. 4, i.e., the results of the simulation of the ideal RTN-based TRNG circuit for different reseeding strategies. To load and plot the data use the "plot_bitmaps.mat" file.

    The result of the circuit simulations considering the EvolvingRTN from the HfO2 device shown in Fig. 7, for two Rgain values. Specifically, the data is contained in the following csv files:

    "Sim_TRNG_Circuit_HfO2_3_20s_Vth_210m_no_Noise_Ibias_11n.csv" (lower Rgain)

    "Sim_TRNG_Circuit_HfO2_3_20s_Vth_210m_no_Noise_Ibias_4_8n.csv" (higher Rgain)

    The result of the circuit simulations considering the temporary RTN from the SiO2 device shown in Fig. 8. Specifically, the data is contained in the following csv files:

    "Sim_TRNG_Circuit_SiO2_1c_300s_Vth_180m_Noise_Ibias_1.5n.csv" (ref. Rgain)

    "Sim_TRNG_Circuit_SiO2_1c_100s_200s_Vth_180m_Noise_Ibias_1.575n.csv" (lower Rgain)

    "Sim_TRNG_Circuit_SiO2_1c_100s_200s_Vth_180m_Noise_Ibias_1.425n.csv" (higher Rgain)

  3. Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source...

    • zenodo.org
    application/gzip, bin +2
    Updated Aug 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb (2024). Ecosystem-Level Determinants of Sustained Activity in Open-Source Projects: A Case Study of the PyPI Ecosystem [Dataset]. http://doi.org/10.5281/zenodo.1419788
    Explore at:
    bin, application/gzip, zip, text/x-pythonAvailable download formats
    Dataset updated
    Aug 2, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb
    License

    https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html

    Description
    Replication pack, FSE2018 submission #164:
    ------------------------------------------
    
    **Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: 
    A Case Study of the PyPI Ecosystem
    
    **Note:** link to data artifacts is already included in the paper. 
    Link to the code will be included in the Camera Ready version as well.
    
    
    Content description
    ===================
    
    - **ghd-0.1.0.zip** - the code archive. This code produces the dataset files 
     described below
    - **settings.py** - settings template for the code archive.
    - **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset.
     This dataset only includes stats aggregated by the ecosystem (PyPI)
    - **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level
     statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages
     themselves, which take around 2TB.
    - **build_model.r, helpers.r** - R files to process the survival data 
      (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, 
      `common.cache/survival_data.pypi_2008_2017-12_6.csv` in 
      **dataset_full_Jan_2018.tgz**)
    - **Interview protocol.pdf** - approximate protocol used for semistructured interviews.
    - LICENSE - text of GPL v3, under which this dataset is published
    - INSTALL.md - replication guide (~2 pages)
    Replication guide
    =================
    
    Step 0 - prerequisites
    ----------------------
    
    - Unix-compatible OS (Linux or OS X)
    - Python interpreter (2.7 was used; Python 3 compatibility is highly likely)
    - R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible)
    
    Depending on detalization level (see Step 2 for more details):
    - up to 2Tb of disk space (see Step 2 detalization levels)
    - at least 16Gb of RAM (64 preferable)
    - few hours to few month of processing time
    
    Step 1 - software
    ----------------
    
    - unpack **ghd-0.1.0.zip**, or clone from gitlab:
    
       git clone https://gitlab.com/user2589/ghd.git
       git checkout 0.1.0
     
     `cd` into the extracted folder. 
     All commands below assume it as a current directory.
      
    - copy `settings.py` into the extracted folder. Edit the file:
      * set `DATASET_PATH` to some newly created folder path
      * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` 
    - install docker. For Ubuntu Linux, the command is 
      `sudo apt-get install docker-compose`
    - install libarchive and headers: `sudo apt-get install libarchive-dev`
    - (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools`
     Without this dependency, you might get an error on the next step, 
     but it's safe to ignore.
    - install Python libraries: `pip install --user -r requirements.txt` . 
    - disable all APIs except GitHub (Bitbucket and Gitlab support were
     not yet implemented when this study was in progress): edit
     `scraper/init.py`, comment out everything except GitHub support
     in `PROVIDERS`.
    
    Step 2 - obtaining the dataset
    -----------------------------
    
    The ultimate goal of this step is to get output of the Python function 
    `common.utils.survival_data()` and save it into a CSV file:
    
      # copy and paste into a Python console
      from common import utils
      survival_data = utils.survival_data('pypi', '2008', smoothing=6)
      survival_data.to_csv('survival_data.csv')
    
    Since full replication will take several months, here are some ways to speedup
    the process:
    
    ####Option 2.a, difficulty level: easiest
    
    Just use the precomputed data. Step 1 is not necessary under this scenario.
    
    - extract **dataset_minimal_Jan_2018.zip**
    - get `survival_data.csv`, go to the next step
    
    ####Option 2.b, difficulty level: easy
    
    Use precomputed longitudinal feature values to build the final table.
    The whole process will take 15..30 minutes.
    
    - create a folder `
  4. RSICD Image Caption Dataset

    • kaggle.com
    Updated Dec 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). RSICD Image Caption Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/rsicd-image-caption-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 6, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    RSICD Image Caption Dataset

    RSICD Image Caption Dataset

    By Arto (From Huggingface) [source]

    About this dataset

    The train.csv file contains a list of image filenames, captions, and the actual images used for training the image captioning models. Similarly, the test.csv file includes a separate set of image filenames, captions, and images specifically designated for testing the accuracy and performance of the trained models.

    Furthermore, the valid.csv file contains a unique collection of image filenames with their respective captions and images that serve as an independent validation set to evaluate the models' capabilities accurately.

    Each entry in these CSV files includes both a filename string that indicates the name or identifier of an image file stored in another location or directory. Additionally,** each entry also provides a list (or multiple rows) o**f strings representing written descriptions or captions describing each respective image given its filename.

    Considering these details about this dataset's structure, it can be immensely valuable to researchers, developers, and enthusiasts working on developing innovative computer vision algorithms such as automatic text generation based on visual content analysis. Whether it's training machine learning models to automatically generate relevant captions based on new unseen images or evaluating existing systems' performance against diverse criteria.

    Stay updated with cutting-edge research trends by leveraging this comprehensive dataset containing not only captio**ns but also corresponding imag**es across different sets specifically designed to cater to varied purposes within computer vision tasks. »

    How to use the dataset

    Overview of the Dataset

    The dataset consists of three primary files: train.csv, test.csv, and valid.csv. These files contain information about image filenames and their respective captions. Each file includes multiple captions for each image to support diverse training techniques.

    Understanding the Files

    • train.csv: This file contains filenames (filename column) and their corresponding captions (captions column) for training your image captioning model.
    • test.csv: The test set is included in this file, which contains a similar structure as that of train.csv. The purpose of this file is to evaluate your trained models on unseen data.
    • valid.csv: This validation set provides images with their respective filenames (filename) and captions (captions). It allows you to fine-tune your models based on performance during evaluation.

    Getting Started

    To begin utilizing this dataset effectively, follow these steps:

    • Extract the zip file containing all relevant data files onto your local machine or cloud environment.
    • Familiarize yourself with each CSV file's structure: train.csv, test.csv, and valid.csv. Understand how information like filename(s) (filename) corresponds with its respective caption(s) (captions).
    • Depending on your specific use case or research goals, determine which portion(s) of the dataset you wish to work with (e.g., only train or train+validation).
    • Load the dataset into your preferred programming environment or machine learning framework, ensuring you have the necessary dependencies installed.
    • Preprocess the dataset as needed, such as resizing images to a specific dimension or encoding captions for model training purposes.
    • Split the data into training, validation, and test sets according to your experimental design requirements.
    • Use appropriate algorithms and techniques to train your image captioning models on the provided data.

    Enhancing Model Performance

    To optimize model performance using this dataset, consider these tips:

    • Explore different architectures and pre-trained models specifically designed for image captioning tasks.
    • Experiment with various natural language

    Research Ideas

    • Image Captioning: This dataset can be used to train and evaluate image captioning models. The captions can be used as target labels for training, and the images can be paired with the captions to generate descriptive captions for test images.
    • Image Retrieval: The dataset can be used for image retrieval tasks where given a query caption, the model needs to retrieve the images that best match the description. This can be useful in applications such as content-based image search.
    • Natural Language Processing: The dataset can also be used for natural language processing tasks such as text generation or machine translation. The captions in this dataset are descriptive ...
  5. The Canada Trademarks Dataset

    • zenodo.org
    pdf, zip
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jeremy Sheff; Jeremy Sheff (2024). The Canada Trademarks Dataset [Dataset]. http://doi.org/10.5281/zenodo.4999655
    Explore at:
    zip, pdfAvailable download formats
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jeremy Sheff; Jeremy Sheff
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Canada Trademarks Dataset

    18 Journal of Empirical Legal Studies 908 (2021), prepublication draft available at https://papers.ssrn.com/abstract=3782655, published version available at https://onlinelibrary.wiley.com/share/author/CHG3HC6GTFMMRU8UJFRR?target=10.1111/jels.12303

    Dataset Selection and Arrangement (c) 2021 Jeremy Sheff

    Python and Stata Scripts (c) 2021 Jeremy Sheff

    Contains data licensed by Her Majesty the Queen in right of Canada, as represented by the Minister of Industry, the minister responsible for the administration of the Canadian Intellectual Property Office.

    This individual-application-level dataset includes records of all applications for registered trademarks in Canada since approximately 1980, and of many preserved applications and registrations dating back to the beginning of Canada’s trademark registry in 1865, totaling over 1.6 million application records. It includes comprehensive bibliographic and lifecycle data; trademark characteristics; goods and services claims; identification of applicants, attorneys, and other interested parties (including address data); detailed prosecution history event data; and data on application, registration, and use claims in countries other than Canada. The dataset has been constructed from public records made available by the Canadian Intellectual Property Office. Both the dataset and the code used to build and analyze it are presented for public use on open-access terms.

    Scripts are licensed for reuse subject to the Creative Commons Attribution License 4.0 (CC-BY-4.0), https://creativecommons.org/licenses/by/4.0/. Data files are licensed for reuse subject to the Creative Commons Attribution License 4.0 (CC-BY-4.0), https://creativecommons.org/licenses/by/4.0/, and also subject to additional conditions imposed by the Canadian Intellectual Property Office (CIPO) as described below.

    Terms of Use:

    As per the terms of use of CIPO's government data, all users are required to include the above-quoted attribution to CIPO in any reproductions of this dataset. They are further required to cease using any record within the datasets that has been modified by CIPO and for which CIPO has issued a notice on its website in accordance with its Terms and Conditions, and to use the datasets in compliance with applicable laws. These requirements are in addition to the terms of the CC-BY-4.0 license, which require attribution to the author (among other terms). For further information on CIPO’s terms and conditions, see https://www.ic.gc.ca/eic/site/cipointernet-internetopic.nsf/eng/wr01935.html. For further information on the CC-BY-4.0 license, see https://creativecommons.org/licenses/by/4.0/.

    The following attribution statement, if included by users of this dataset, is satisfactory to the author, but the author makes no representations as to whether it may be satisfactory to CIPO:

    The Canada Trademarks Dataset is (c) 2021 by Jeremy Sheff and licensed under a CC-BY-4.0 license, subject to additional terms imposed by the Canadian Intellectual Property Office. It contains data licensed by Her Majesty the Queen in right of Canada, as represented by the Minister of Industry, the minister responsible for the administration of the Canadian Intellectual Property Office. For further information, see https://creativecommons.org/licenses/by/4.0/ and https://www.ic.gc.ca/eic/site/cipointernet-internetopic.nsf/eng/wr01935.html.

    Details of Repository Contents:

    This repository includes a number of .zip archives which expand into folders containing either scripts for construction and analysis of the dataset or data files comprising the dataset itself. These folders are as follows:

    • /csv: contains the .csv versions of the data files
    • /do: contains Stata do-files used to convert the .csv files to .dta format and perform the statistical analyses set forth in the paper reporting this dataset
    • /dta: contains the .dta versions of the data files
    • /py: contains the python scripts used to download CIPO’s historical trademarks data via SFTP and generate the .csv data files

    If users wish to construct rather than download the datafiles, the first script that they should run is /py/sftp_secure.py. This script will prompt the user to enter their IP Horizons SFTP credentials; these can be obtained by registering with CIPO at https://ised-isde.survey-sondage.ca/f/s.aspx?s=59f3b3a4-2fb5-49a4-b064-645a5e3a752d&lang=EN&ds=SFTP. The script will also prompt the user to identify a target directory for the data downloads. Because the data archives are quite large, users are advised to create a target directory in advance and ensure they have at least 70GB of available storage on the media in which the directory is located.

    The sftp_secure.py script will generate a new subfolder in the user’s target directory called /XML_raw. Users should note the full path of this directory, which they will be prompted to provide when running the remaining python scripts. Each of the remaining scripts, the filenames of which begin with “iterparse”, corresponds to one of the data files in the dataset, as indicated in the script’s filename. After running one of these scripts, the user’s target directory should include a /csv subdirectory containing the data file corresponding to the script; after running all the iterparse scripts the user’s /csv directory should be identical to the /csv directory in this repository. Users are invited to modify these scripts as they see fit, subject to the terms of the licenses set forth above.

    With respect to the Stata do-files, only one of them is relevant to construction of the dataset itself. This is /do/CA_TM_csv_cleanup.do, which converts the .csv versions of the data files to .dta format, and uses Stata’s labeling functionality to reduce the size of the resulting files while preserving information. The other do-files generate the analyses and graphics presented in the paper describing the dataset (Jeremy N. Sheff, The Canada Trademarks Dataset, 18 J. Empirical Leg. Studies (forthcoming 2021)), available at https://papers.ssrn.com/abstract=3782655). These do-files are also licensed for reuse subject to the terms of the CC-BY-4.0 license, and users are invited to adapt the scripts to their needs.

    The python and Stata scripts included in this repository are separately maintained and updated on Github at https://github.com/jnsheff/CanadaTM.

    This repository also includes a copy of the current version of CIPO's data dictionary for its historical XML trademarks archive as of the date of construction of this dataset.

  6. G

    CSV Automation Tools Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Sep 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). CSV Automation Tools Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/csv-automation-tools-market
    Explore at:
    pptx, csv, pdfAvailable download formats
    Dataset updated
    Sep 1, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    CSV Automation Tools Market Outlook



    According to our latest research, the global CSV Automation Tools market size reached USD 1.46 billion in 2024, reflecting robust adoption across diverse industries. The market is projected to grow at a CAGR of 11.8% from 2025 to 2033, reaching a forecasted value of USD 4.17 billion by 2033. This impressive growth trajectory is primarily driven by the increasing need for efficient data management, seamless integration, and automation of repetitive tasks in enterprise environments. The proliferation of digital transformation initiatives and the surge in data volumes are further fueling the demand for advanced CSV automation solutions globally.




    The primary growth driver for the CSV Automation Tools market is the exponential rise in data generation across industries such as BFSI, healthcare, IT, and retail. As organizations increasingly rely on data-driven decision-making, the need for tools that can automate the processing, integration, and analysis of CSV files becomes paramount. CSV files remain a universal format for data exchange due to their simplicity and compatibility, but managing large volumes manually is both time-consuming and error-prone. Automation tools reduce manual intervention, improve data accuracy, and accelerate workflows, making them indispensable in modern enterprises. Furthermore, the growing adoption of cloud computing and SaaS-based solutions has made CSV automation tools more accessible and scalable, enabling organizations of all sizes to harness their benefits without substantial upfront investment.




    Another significant factor propelling market growth is the increasing complexity of data integration and migration projects. As businesses adopt hybrid and multi-cloud infrastructures, the need to move, cleanse, and synchronize data between disparate systems has become more challenging. CSV automation tools offer robust capabilities for data mapping, transformation, and validation, ensuring seamless migration and integration processes. These tools also support compliance with data governance regulations, as they help maintain data quality and traceability throughout the data lifecycle. The integration of artificial intelligence and machine learning into CSV automation solutions is further enhancing their capabilities, enabling intelligent data cleansing, anomaly detection, and predictive analytics, which are critical for maintaining high data standards and supporting advanced business intelligence initiatives.




    Additionally, the rising focus on operational efficiency and cost reduction is encouraging organizations to invest in CSV automation tools. By automating repetitive and labor-intensive tasks such as data extraction, transformation, and loading (ETL), companies can significantly reduce manual errors, save time, and allocate resources to more strategic activities. This not only improves productivity but also ensures data consistency across various business applications. The shift towards remote and hybrid work models has further emphasized the need for automated solutions that can be managed and monitored remotely, driving the adoption of cloud-based CSV automation tools. As businesses continue to prioritize agility and scalability, the demand for flexible and customizable automation solutions is expected to rise, further boosting market growth over the forecast period.



    In the realm of data management, Spreadsheet Automation Tools have emerged as pivotal in streamlining operations across various sectors. These tools are designed to automate the handling of spreadsheets, which are ubiquitous in business environments for tasks ranging from data entry to complex financial modeling. By reducing the manual effort involved in managing spreadsheets, these tools not only enhance accuracy but also free up valuable time for employees to focus on more strategic initiatives. The integration of these tools with existing systems can lead to significant improvements in productivity and data consistency, making them an essential component of modern data management strategies. As businesses continue to seek efficiency and precision in their operations, the adoption of spreadsheet automation tools is expected to rise, further driving the growth of the automation market.




    From a regional perspective, North America currently dominates the CSV Automation Tools market, accountin

  7. G

    CSV Editor Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Sep 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). CSV Editor Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/csv-editor-market
    Explore at:
    csv, pptx, pdfAvailable download formats
    Dataset updated
    Sep 1, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    CSV Editor Market Outlook



    According to our latest research, the global CSV Editor market size reached USD 1.24 billion in 2024, reflecting robust adoption across various industries. The market is projected to expand at a CAGR of 7.8% from 2025 to 2033, reaching an estimated value of USD 2.43 billion by 2033. This impressive growth trajectory is primarily driven by the rising demand for efficient data management tools, the proliferation of digital transformation initiatives, and the increasing reliance on structured data formats for analytics and business intelligence applications.




    One of the most significant growth factors for the CSV Editor market is the exponential increase in data generation across enterprises of all sizes. Organizations are increasingly leveraging CSV editors to manage, clean, and manipulate large datasets for analytics, reporting, and integration purposes. The surge in cloud computing adoption has further amplified the need for agile, scalable, and collaborative data editing solutions, making CSV editors an indispensable tool in the modern data stack. Furthermore, the integration of advanced features such as real-time collaboration, data validation, and seamless interoperability with other business applications has significantly enhanced the value proposition of contemporary CSV editors, driving their adoption across both technical and non-technical user segments.




    Another critical driver for the CSV Editor market is the growing emphasis on data-driven decision-making within enterprises. As organizations strive to extract actionable insights from vast volumes of structured and semi-structured data, the ability to efficiently manipulate and curate CSV files becomes paramount. CSV editors are increasingly being integrated with business intelligence platforms, data warehouses, and ETL (Extract, Transform, Load) pipelines, enabling users to streamline data preparation workflows and reduce time-to-insight. The emergence of low-code and no-code platforms has also democratized access to data editing tools, empowering business users to participate in data management processes without requiring extensive technical expertise.




    The rapid evolution of regulatory requirements concerning data privacy and governance is also fueling the demand for advanced CSV editors. Organizations must ensure that their data handling practices comply with regulations such as GDPR, HIPAA, and CCPA, which necessitates robust data auditing, validation, and version control capabilities. Modern CSV editors are being equipped with features that facilitate compliance, such as audit trails, role-based access controls, and automated data masking. As a result, industries with stringent compliance mandates, including BFSI, healthcare, and government, are increasingly adopting sophisticated CSV editing solutions to mitigate risks and ensure data integrity.




    From a regional perspective, North America continues to dominate the CSV Editor market owing to its mature IT infrastructure, high digital adoption rates, and a strong presence of leading software vendors. However, the Asia Pacific region is witnessing the fastest growth, driven by rapid digitalization, expanding SME sector, and increasing investments in cloud-based solutions. Europe also presents significant opportunities due to the regionÂ’s focus on data protection and the proliferation of data-centric business models. Latin America and the Middle East & Africa regions are gradually catching up, supported by improving internet penetration and government-led digital initiatives.



    In the realm of data management tools, the role of a YAML Editor is becoming increasingly significant. As organizations continue to embrace DevOps practices and infrastructure as code, YAML files are frequently used to define configurations, automate processes, and manage application deployments. The simplicity and readability of YAML make it an ideal choice for configuration management, enabling developers and IT professionals to streamline their workflows and reduce the risk of errors. With the growing complexity of IT environments, a robust YAML Editor is essential for ensuring accuracy and consistency across configuration files, thereby enhancing operational efficiency and reducing downtime.



    <a href="https:

  8. D

    CSV Editor Market Research Report 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Oct 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). CSV Editor Market Research Report 2033 [Dataset]. https://dataintelo.com/report/csv-editor-market
    Explore at:
    pdf, csv, pptxAvailable download formats
    Dataset updated
    Oct 1, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    CSV Editor Market Outlook



    According to our latest research, the global CSV Editor market size reached USD 1.32 billion in 2024, reflecting the growing integration of data-centric processes across numerous industries. The market is experiencing robust expansion, supported by a CAGR of 10.4% from 2025 to 2033. By the end of the forecast period in 2033, the CSV Editor market is anticipated to achieve a value of USD 3.16 billion. The primary growth factor driving this market is the increasing reliance on structured data formats for business intelligence, analytics, and process automation, which has led to a surge in demand for advanced CSV editing tools globally.




    The growth of the CSV Editor market is underpinned by the digital transformation initiatives adopted by enterprises across sectors such as BFSI, healthcare, IT, and retail. As organizations generate and handle exponentially larger volumes of data, the need to efficiently manage, clean, and manipulate CSV files has become crucial. CSV Editors, which enable users to modify, validate, and visualize large datasets, are now considered essential for data-driven decision-making. The proliferation of cloud computing and the rise of big data analytics have further accentuated the importance of robust CSV editing solutions, as businesses seek to streamline workflows and enhance productivity.




    Another significant growth driver is the increasing adoption of automation and artificial intelligence in data management processes. Modern CSV Editors are evolving from simple file manipulation tools to sophisticated platforms that support scripting, automation, and integration with other enterprise software. This evolution is particularly evident in industries such as healthcare and finance, where the accuracy and consistency of data are paramount. The availability of both on-premises and cloud-based deployment modes has also broadened the market’s appeal, catering to organizations with varying security and compliance requirements. Furthermore, the growing trend of remote work and distributed teams has fueled demand for web-based CSV Editors that facilitate real-time collaboration and seamless access from multiple devices.




    The CSV Editor market is also benefitting from the increasing focus on data governance and regulatory compliance. As governments and regulatory bodies implement stricter data protection laws, organizations are compelled to invest in tools that ensure data integrity and traceability. CSV Editors play a pivotal role in maintaining audit trails, validating data formats, and supporting compliance with standards such as GDPR and HIPAA. This regulatory backdrop, combined with the rise in cyber threats and data breaches, has made secure and feature-rich CSV Editors a necessity for enterprises seeking to mitigate risks and safeguard sensitive information.




    Regionally, North America dominates the CSV Editor market, accounting for the largest revenue share in 2024, driven by the presence of leading technology firms and widespread adoption of data management solutions. Europe follows closely, with strong demand from the BFSI and healthcare sectors. The Asia Pacific region is emerging as a high-growth market, fueled by rapid digitization, expanding IT infrastructure, and increased investments in data analytics. Latin America and the Middle East & Africa are also witnessing steady growth, albeit from a smaller base, as organizations in these regions gradually embrace digital transformation and modern data management practices.



    Component Analysis



    The CSV Editor market is segmented by component into Software and Services, with software solutions representing the lion's share of the market in 2024. The software segment encompasses standalone CSV editing applications, integrated development environments (IDEs), and plug-ins that facilitate the manipulation and validation of CSV files. These solutions are in high demand due to their ability to handle large datasets, support complex data transformations, and provide user-friendly interfaces for both technical and non-technical users. The continuous evolution of software features, such as real-time collaboration, version control, and advanced data visualization, is further propelling the adoption of CSV Editor software across industries.




    The services segment, while smaller in comparison, is gaining traction as organizations seek

  9. Wikipedia Biographies Text Generation Dataset

    • kaggle.com
    zip
    Updated Dec 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Wikipedia Biographies Text Generation Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/wikipedia-biographies-text-generation-dataset/code
    Explore at:
    zip(269983242 bytes)Available download formats
    Dataset updated
    Dec 3, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Wikipedia Biographies Text Generation Dataset

    Wikipedia Biographies: Infobox and First Paragraphs Texts

    By wiki_bio (From Huggingface) [source]

    About this dataset

    The dataset contains several key columns: input_text and target_text. The input_text column includes the infobox and first paragraph of a Wikipedia biography, providing essential information about the individual's background, accomplishments, and notable features. The target_text column consists of the complete biography text extracted from the corresponding Wikipedia page.

    In order to facilitate model training and validation, the dataset is divided into three main files: train.csv, val.csv, and test.csv. The train.csv file contains pairs of input text and target text for model training. It serves as a fundamental resource to develop accurate language generation models by providing abundant examples for learning to generate coherent biographical texts.

    The val.csv file provides further validation data consisting of additional Wikipedia biographies with their corresponding infoboxes and first paragraphs. This subset allows researchers to evaluate their trained models' performance on unseen examples during development or fine-tuning stages.

    Finally, the test.csv file offers a separate set of input texts paired with corresponding target texts for generating complete biographies using pre-trained models or newly developed algorithms. The purpose of this file is to benchmark system performance on unseen data in order to assess generalization capabilities.

    This extended description aims to provide an informative overview of the dataset structure, its intended use cases in natural language processing research tasks such as text generation or summarization. Researchers can leverage this comprehensive collection to advance various applications in automatic biography writing systems or content generation tasks that require coherent textual output based on provided partial information extracted from an infobox or initial paragraph sources from online encyclopedias like Wikipedia

    How to use the dataset

    • Overview:

      • This dataset consists of biographical information from Wikipedia pages, specifically the infobox and the first paragraph of each biography.
      • The dataset is provided in three separate files: train.csv, val.csv, and test.csv.
      • Each file contains pairs of input text and target text.
    • File Descriptions:

      • train.csv: This file is used for training purposes. It includes pairs of input text (infobox and first paragraph) and target text (complete biography).
      • val.csv: Validation purposes can be fulfilled using this file. It contains a collection of biographies with infobox and first paragraph texts.
      • test.csv: This file can be used to generate complete biographies based on the given input texts.
    • Column Information:

      a) For train.csv:

      • input_text: Input text column containing the infobox and first paragraph of a Wikipedia biography.
      • target_text: Target text column containing the complete biography text for each entry.

      b) For val.csv: - input_text: Infobox and first paragraph texts are included in this column. - target_text: Complete biography texts are present in this column.

      c) For test.csv: The columns follow the pattern mentioned previously, i.e.,input_text followed by target_text.

    • Usage Guidelines:

    • Training Model or Algorithm Development: If you are working on training a model or developing an algorithm for generating complete biographies from given inputs, it is recommended to use train.csv as your primary dataset.

    • Model Validation or Evaluation: To validate or evaluate your trained model, you can use val.csv as an independent dataset. This dataset contains biographies that have been withheld from the training data.

    • Generating Biographies with Trained Models: To generate complete biographies using your trained model, you can make use of test.csv. This dataset provides input texts for which you need to generate the corresponding target texts.

    • Additional Information and Tips:

    • The input text in this dataset includes both an infobox (a structured section containing key-value pairs) and the first paragraph of a Wikipedia biography.

    • The target text is the complete biography for each entry.

    • While working with this dataset, make sure to preprocess and

    Research Ideas

    • Text Generation: The dataset can be used to train language models to generate complete Wikipedia biographies given only the infobox and first paragraph ...
  10. h

    windows-event-codes-qanda

    • huggingface.co
    Updated Aug 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    whit3rabbit (2024). windows-event-codes-qanda [Dataset]. https://huggingface.co/datasets/cowWhySo/windows-event-codes-qanda
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 8, 2024
    Authors
    whit3rabbit
    License

    https://choosealicense.com/licenses/unlicense/https://choosealicense.com/licenses/unlicense/

    Description

    I used my notebook here to generate CSV files for Windows event codes: https://github.com/whit3rabbit/Windows-Event-Codes-CSV CSV is here: https://github.com/whit3rabbit/Windows-Event-Codes-CSV/blob/main/updated_detailed_events.csv Converted each line to markdown and used it to generate Questions and Answers. These have not been vetted for accuracy so use with caution.

  11. ENTSO-E Hydropower modelling data (PECD) in CSV format

    • zenodo.org
    csv
    Updated Aug 14, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matteo De Felice; Matteo De Felice (2020). ENTSO-E Hydropower modelling data (PECD) in CSV format [Dataset]. http://doi.org/10.5281/zenodo.3949757
    Explore at:
    csvAvailable download formats
    Dataset updated
    Aug 14, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Matteo De Felice; Matteo De Felice
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    PECD Hydro modelling

    This repository contains a more user-friendly version of the Hydro modelling data released by ENTSO-E with their latest Seasonal Outlook.

    The original URLs:

    The original ENTSO-E hydropower dataset integrates the PECD (Pan-European Climate Database) released for the MAF 2019

    As I did for the wind & solar data, the datasets released in this repository are only a more user- and machine-readable version of the original Excel files. As avid user of ENTSO-E data, with this repository I want to share my data wrangling efforts to make this dataset more accessible.

    Data description

    The zipped file contains 86 Excel files, two different files for each ENTSO-E zone.

    In this repository you can find 6 CSV files:

    • PECD-hydro-capacities.csv: installed capacities
    • PECD-hydro-weekly-inflows.csv: weekly inflows for reservoir and open-loop pumping
    • PECD-hydro-daily-ror-generation.csv: daily run-of-river generation
    • PECD-hydro-weekly-reservoir-min-max-generation.csv: minimum and maximum weekly reservoir generation
    • PECD-hydro-weekly-reservoir-min-max-levels.csv: weekly minimum and maximum reservoir levels

    Capacities

    The file PECD-hydro-capacities.csv contains: run of river capacity (MW) and storage capacity (GWh), reservoir plants capacity (MW) and storage capacity (GWh), closed-loop pumping/turbining (MW) and storage capacity and open-loop pumping/turbining (MW) and storage capacity. The data is extracted from the Excel files with the name starting with PEMM from the following sections:

    • sheet Run-of-River and pondage, rows from 5 to 7, columns from 2 to 5
    • sheet Reservoir, rows from 5 to 7, columns from 1 to 3
    • sheet Pump storage - Open Loop, rows from 5 to 7, columns from 1 to 3
    • sheet Pump storage - Closed Loop, rows from 5 to 7, columns from 1 to 3

    Inflows

    The file PECD-hydro-weekly-inflows.csv contains the weekly inflow (GWh) for the climatic years 1982-2017 for reservoir plants and open-loop pumping. The data is extracted from the Excel files with the name starting with PEMM from the following sections:

    • sheet Reservoir, rows from 13 to 66, columns from 16 to 51
    • sheet Pump storage - Open Loop, rows from 13 to 66, columns from 16 to 51

    Daily run-of-river

    The file PECD-hydro-daily-ror-generation.csv contains the daily run-of-river generation (GWh). The data is extracted from the Excel files with the name starting with PEMM from the following sections:

    • sheet Run-of-River and pondage, rows from 13 to 378, columns from 15 to 51

    Miminum and maximum reservoir generation

    The file PECD-hydro-weekly-reservoir-min-max-generation.csv contains the minimum and maximum generation (MW, weekly) for reservoir-based plants for the climatic years 1982-2017. The data is extracted from the Excel files with the name starting with PEMM from the following sections:

    • sheet Reservoir, rows from 13 to 66, columns from 196 to 231
    • sheet Reservoir, rows from 13 to 66, columns from 232 to 267

    Minimum/Maximum reservoir levels

    The file PECD-hydro-weekly-reservoir-min-max-levels.csv contains the minimum/maximum reservoir levels at beginning of each week (scaled coefficient from 0 to 1). The data is extracted from the Excel files with the name starting with PEMM from the following sections:

    • sheet Reservoir, rows from 14 to 66, column 12
    • sheet Reservoir, rows from 14 to 66, column 13

    CHANGELOG

    [2020/07/17] Added maximum generation for the reservoir

  12. m

    KU-MG2: A Dataset for Hybrid Photovoltaic-Natural Gas Generator Microgrid...

    • data.mendeley.com
    • search.datacite.org
    Updated Jul 28, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdullah-Al Nahid (2020). KU-MG2: A Dataset for Hybrid Photovoltaic-Natural Gas Generator Microgrid Model of a Residential Area. (For Padma residential area, Rajshahi, Bangladesh) [Dataset]. http://doi.org/10.17632/js5mtkf5yk.1
    Explore at:
    Dataset updated
    Jul 28, 2020
    Authors
    Abdullah-Al Nahid
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Bangladesh, Rajshahi, Padma Residential Area
    Description

    a renewable energy resource-based sustainable microgrid model for a residential area is designed by HOMER PRO microgrid software. A small-sized residential area of 20 buildings of about 60 families with 219 MWh and an electric vehicle charging station of daily 10 batteries with 18.3MWh annual energy consumption considered for Padma residential area, Rajshahi (24°22.6'N, 88°37.2'E) is selected as our case study. Solar panels, natural gas generator, inverter and Li-ion batteries are required for our proposed model. The HOMER PRO microgrid software is used to optimize our designed microgrid model. Data were collected from HOMER PRO for the year 2007. We have compared our daily load demand 650KW with the results varying the load by 10%, 5%, 2.5% more and less to find out the best case according to our demand. We have a total of 7 different datasets for different load conditions where each dataset contains a total of 8760 sets of data having 6 different parameters for each set. Data file contents: Data 1:: original_load.csv: This file contains data for 650KW load demand. The dataset contains a total of 8760 sets of data having 6 different parameters for each set. Data arrangement is given below: Column 1: Date and time of data recording in the format of MM-DD- YYYY [hh]:[mm]. Time is in 24-hour format. Column 2: Solar power output in KW unit. Column 3: Generator power output in KW unit. Column 4: Total Electrical load served in KW unit. Column 5: Excess electrical production in KW unit. Column 6: Li-ion battery energy content in KWh unit. Column 7: Li-ion battery state of charge in % unit.

    Data 2:: 2.5%_more_load.csv: This file contains data for 677KW load demand. The dataset contains a total of 8760 sets of data having 6 different parameters for each set. Column information is the same for every dataset.

    Data 3:: 2.5%_less_load.csv: This file contains data for 622KW load demand. The dataset contains a total of 8760 sets of data having 6 different parameters for each set. Column information is the same for every dataset.

    Data 4:: 5%_more_load.csv: This file contains data for 705KW load demand. The dataset contains a total of 8760 sets of data having 6 different parameters for each set. Column information is the same for every dataset. Data 5:: 5%_less_load.csv: This file contains data for 595KW load demand. The dataset contains a total of 8760 sets of data having 6 different parameters for each set. Column information is the same for every dataset. Data 6:: 10%_more_load.csv: This file contains data for the 760KW load demand. The dataset contains a total of 8760 sets of data having 6 different parameters for each set. Column information is the same for every dataset. Data 7:: 10%_less_load.csv: This file contains data for 540KW load demand. The dataset contains a total of 8760 sets of data having 6 different parameters for each set. Column information is the same for every dataset.

  13. Z

    TRAVEL: A Dataset with Toolchains for Test Generation and Regression Testing...

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    Updated Jul 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pouria Derakhshanfar; Annibale Panichella; Alessio Gambi; Vincenzo Riccio; Christian Birchler; Sebastiano Panichella (2024). TRAVEL: A Dataset with Toolchains for Test Generation and Regression Testing of Self-driving Cars Software [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_5911160
    Explore at:
    Dataset updated
    Jul 17, 2024
    Dataset provided by
    Università della Svizzera Italiana
    University of Passau
    Zurich University of Applied Sciences
    Delft University of Technology
    Authors
    Pouria Derakhshanfar; Annibale Panichella; Alessio Gambi; Vincenzo Riccio; Christian Birchler; Sebastiano Panichella
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Introduction

    This repository hosts the Testing Roads for Autonomous VEhicLes (TRAVEL) dataset. TRAVEL is an extensive collection of virtual roads that have been used for testing lane assist/keeping systems (i.e., driving agents) and data from their execution in state of the art, physically accurate driving simulator, called BeamNG.tech. Virtual roads consist of sequences of road points interpolated using Cubic splines.

    Along with the data, this repository contains instructions on how to install the tooling necessary to generate new data (i.e., test cases) and analyze them in the context of test regression. We focus on test selection and test prioritization, given their importance for developing high-quality software following the DevOps paradigms.

    This dataset builds on top of our previous work in this area, including work on

    test generation (e.g., AsFault, DeepJanus, and DeepHyperion) and the SBST CPS tool competition (SBST2021),

    test selection: SDC-Scissor and related tool

    test prioritization: automated test cases prioritization work for SDCs.

    Dataset Overview

    The TRAVEL dataset is available under the data folder and is organized as a set of experiments folders. Each of these folders is generated by running the test-generator (see below) and contains the configuration used for generating the data (experiment_description.csv), various statistics on generated tests (generation_stats.csv) and found faults (oob_stats.csv). Additionally, the folders contain the raw test cases generated and executed during each experiment (test..json).

    The following sections describe what each of those files contains.

    Experiment Description

    The experiment_description.csv contains the settings used to generate the data, including:

    Time budget. The overall generation budget in hours. This budget includes both the time to generate and execute the tests as driving simulations.

    The size of the map. The size of the squared map defines the boundaries inside which the virtual roads develop in meters.

    The test subject. The driving agent that implements the lane-keeping system under test. The TRAVEL dataset contains data generated testing the BeamNG.AI and the end-to-end Dave2 systems.

    The test generator. The algorithm that generated the test cases. The TRAVEL dataset contains data obtained using various algorithms, ranging from naive and advanced random generators to complex evolutionary algorithms, for generating tests.

    The speed limit. The maximum speed at which the driving agent under test can travel.

    Out of Bound (OOB) tolerance. The test cases' oracle that defines the tolerable amount of the ego-car that can lie outside the lane boundaries. This parameter ranges between 0.0 and 1.0. In the former case, a test failure triggers as soon as any part of the ego-vehicle goes out of the lane boundary; in the latter case, a test failure triggers only if the entire body of the ego-car falls outside the lane.

    Experiment Statistics

    The generation_stats.csv contains statistics about the test generation, including:

    Total number of generated tests. The number of tests generated during an experiment. This number is broken down into the number of valid tests and invalid tests. Valid tests contain virtual roads that do not self-intersect and contain turns that are not too sharp.

    Test outcome. The test outcome contains the number of passed tests, failed tests, and test in error. Passed and failed tests are defined by the OOB Tolerance and an additional (implicit) oracle that checks whether the ego-car is moving or standing. Tests that did not pass because of other errors (e.g., the simulator crashed) are reported in a separated category.

    The TRAVEL dataset also contains statistics about the failed tests, including the overall number of failed tests (total oob) and its breakdown into OOB that happened while driving left or right. Further statistics about the diversity (i.e., sparseness) of the failures are also reported.

    Test Cases and Executions

    Each test..json contains information about a test case and, if the test case is valid, the data observed during its execution as driving simulation.

    The data about the test case definition include:

    The road points. The list of points in a 2D space that identifies the center of the virtual road, and their interpolation using cubic splines (interpolated_points)

    The test ID. The unique identifier of the test in the experiment.

    Validity flag and explanation. A flag that indicates whether the test is valid or not, and a brief message describing why the test is not considered valid (e.g., the road contains sharp turns or the road self intersects)

    The test data are organized according to the following JSON Schema and can be interpreted as RoadTest objects provided by the tests_generation.py module.

    { "type": "object", "properties": { "id": { "type": "integer" }, "is_valid": { "type": "boolean" }, "validation_message": { "type": "string" }, "road_points": { §\label{line:road-points}§ "type": "array", "items": { "$ref": "schemas/pair" }, }, "interpolated_points": { §\label{line:interpolated-points}§ "type": "array", "items": { "$ref": "schemas/pair" }, }, "test_outcome": { "type": "string" }, §\label{line:test-outcome}§ "description": { "type": "string" }, "execution_data": { "type": "array", "items": { "$ref" : "schemas/simulationdata" } } }, "required": [ "id", "is_valid", "validation_message", "road_points", "interpolated_points" ] }

    Finally, the execution data contain a list of timestamped state information recorded by the driving simulation. State information is collected at constant frequency and includes absolute position, rotation, and velocity of the ego-car, its speed in Km/h, and control inputs from the driving agent (steering, throttle, and braking). Additionally, execution data contain OOB-related data, such as the lateral distance between the car and the lane center and the OOB percentage (i.e., how much the car is outside the lane).

    The simulation data adhere to the following (simplified) JSON Schema and can be interpreted as Python objects using the simulation_data.py module.

    { "$id": "schemas/simulationdata", "type": "object", "properties": { "timer" : { "type": "number" }, "pos" : { "type": "array", "items":{ "$ref" : "schemas/triple" } } "vel" : { "type": "array", "items":{ "$ref" : "schemas/triple" } } "vel_kmh" : { "type": "number" }, "steering" : { "type": "number" }, "brake" : { "type": "number" }, "throttle" : { "type": "number" }, "is_oob" : { "type": "number" }, "oob_percentage" : { "type": "number" } §\label{line:oob-percentage}§ }, "required": [ "timer", "pos", "vel", "vel_kmh", "steering", "brake", "throttle", "is_oob", "oob_percentage" ] }

    Dataset Content

    The TRAVEL dataset is a lively initiative so the content of the dataset is subject to change. Currently, the dataset contains the data collected during the SBST CPS tool competition, and data collected in the context of our recent work on test selection (SDC-Scissor work and tool) and test prioritization (automated test cases prioritization work for SDCs).

    SBST CPS Tool Competition Data

    The data collected during the SBST CPS tool competition are stored inside data/competition.tar.gz. The file contains the test cases generated by Deeper, Frenetic, AdaFrenetic, and Swat, the open-source test generators submitted to the competition and executed against BeamNG.AI with an aggression factor of 0.7 (i.e., conservative driver).

        Name
        Map Size (m x m)
        Max Speed (Km/h)
        Budget (h)
        OOB Tolerance (%)
        Test Subject
    
    
    
    
        DEFAULT
        200 × 200
        120
        5 (real time)
        0.95
        BeamNG.AI - 0.7
    
    
        SBST
        200 × 200
        70
        2 (real time)
        0.5
        BeamNG.AI - 0.7
    

    Specifically, the TRAVEL dataset contains 8 repetitions for each of the above configurations for each test generator totaling 64 experiments.

    SDC Scissor

    With SDC-Scissor we collected data based on the Frenetic test generator. The data is stored inside data/sdc-scissor.tar.gz. The following table summarizes the used parameters.

        Name
        Map Size (m x m)
        Max Speed (Km/h)
        Budget (h)
        OOB Tolerance (%)
        Test Subject
    
    
    
    
        SDC-SCISSOR
        200 × 200
        120
        16 (real time)
        0.5
        BeamNG.AI - 1.5
    

    The dataset contains 9 experiments with the above configuration. For generating your own data with SDC-Scissor follow the instructions in its repository.

    Dataset Statistics

    Here is an overview of the TRAVEL dataset: generated tests, executed tests, and faults found by all the test generators grouped by experiment configuration. Some 25,845 test cases are generated by running 4 test generators 8 times in 2 configurations using the SBST CPS Tool Competition code pipeline (SBST in the table). We ran the test generators for 5 hours, allowing the ego-car a generous speed limit (120 Km/h) and defining a high OOB tolerance (i.e., 0.95), and we also ran the test generators using a smaller generation budget (i.e., 2 hours) and speed limit (i.e., 70 Km/h) while setting the OOB tolerance to a lower value (i.e., 0.85). We also collected some 5, 971 additional tests with SDC-Scissor (SDC-Scissor in the table) by running it 9 times for 16 hours using Frenetic as a test generator and defining a more realistic OOB tolerance (i.e., 0.50).

    Generating new Data

    Generating new data, i.e., test cases, can be done using the SBST CPS Tool Competition pipeline and the driving simulator BeamNG.tech.

    Extensive instructions on how to install both software are reported inside the SBST CPS Tool Competition pipeline Documentation;

  14. h

    Data from: playlist-generator

    • huggingface.co
    Updated Jun 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nima Boscarino (2022). playlist-generator [Dataset]. https://huggingface.co/datasets/NimaBoscarino/playlist-generator
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 29, 2022
    Authors
    Nima Boscarino
    Description

    Playlist Generator Dataset

    This dataset contains three files, used in the Playlist Generator space. Visit the blog post to learn more about the project: https://huggingface.co/blog/your-first-ml-project

    verse-embeddings.pkl contains Sentence Transformer embeddings for each verse for each song in a private (unreleased) dataset of song lyrics. The embeddings were generated using this model: https://huggingface.co/sentence-transformers/msmarco-MiniLM-L-6-v3 verses.csv maps each verse… See the full description on the dataset page: https://huggingface.co/datasets/NimaBoscarino/playlist-generator.

  15. Third Generation Simulation Data (TGSIM) I-294 L1 Trajectories

    • catalog.data.gov
    • data.transportation.gov
    • +2more
    Updated Sep 30, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Federal Highway Administration (2025). Third Generation Simulation Data (TGSIM) I-294 L1 Trajectories [Dataset]. https://catalog.data.gov/dataset/third-generation-simulation-data-tgsim-i-294-l1-trajectories
    Explore at:
    Dataset updated
    Sep 30, 2025
    Dataset provided by
    Federal Highway Administrationhttps://highways.dot.gov/
    Area covered
    Interstate 294
    Description

    The main dataset is a 70 MB file of trajectory data (I294_L1_final.csv) that contains position, speed, and acceleration data for small and large automated (L1) vehicles and non-automated vehicles on a highway in a suburban environment. Supporting files include aerial reference images for ten distinct data collection “Runs” (I294_L1_RunX_with_lanes.png, where X equals 8, 18, and 20 for southbound runs and 1, 3, 7, 9, 11, 19, and 21 for northbound runs). Associated centerline files are also provided for each “Run” (I-294-L1-Run_X-geometry-with-ramps.csv). In each centerline file, x and y coordinates (in meters) marking each lane centerline are provided. The origin point of the reference image is located at the top left corner. Additionally, in each centerline file, an indicator variable is used for each lane to define the following types of road sections: 0=no ramp, 1=on-ramps, 2=off-ramps, and 3=weaving segments. The number attached to each column header is the numerical ID assigned for the specific lane (see “TGSIM – Centerline Data Dictionary – I294 L1.csv” for more details). The dataset defines eight lanes (four lanes in each direction) using these centerline files. Images that map the lanes of interest to the numerical lane IDs referenced in the trajectory dataset are stored in the folder titled “Annotation on Regions.zip”. The southbound lanes are shown visually in I294_L1_Lane-2.png through I294_L1_Lane-5.png and the northbound lanes are shown visually in I294_L1_Lane2.png through I294_L1_Lane5.png. This dataset was collected as part of the Third Generation Simulation Data (TGSIM): A Closer Look at the Impacts of Automated Driving Systems on Human Behavior project. During the project, six trajectory datasets capable of characterizing human-automated vehicle interactions under a diverse set of scenarios in highway and city environments were collected and processed. For more information, see the project report found here: https://rosap.ntl.bts.gov/view/dot/74647. This dataset, which is one of the six collected as part of the TGSIM project, contains data collected using one high-resolution 8K camera mounted on a helicopter that followed three SAE Level 1 ADAS-equipped vehicles with adaptive cruise control (ACC) enabled. The three vehicles manually entered the highway, moved to the second from left most lane, then enabled ACC with minimum following distance settings to initiate a string. The helicopter then followed the string of vehicles (which sometimes broke from the sting due to large following distances) northbound through the 4.8 km section of highway at an altitude of 300 meters. The goal of the data collection effort was to collect data related to human drivers' responses to vehicle strings. The road segment has four lanes in each direction and covers major on-ramp and an off-ramp in the southbound direction and one on-ramp in the northbound direction. The segment of highway is operated by Illinois Tollway and contains a high percentage of heavy vehicles. The camera captured footage during the evening rush hour (3:00 PM-5:00 PM CT) on a sunny day. As part of this dataset, the following files were provided: I294_L1_final.csv contains the numerical data to be used for analysis that includes vehicle level trajectory data at every 0.1 second. Vehicle size (small or large), width, length, and whether the vehicle was one of the test vehicles with ACC engaged ("yes" or "no") are provided with instantaneous location, speed, and acceleration data. All distance measurements (width, length, location) were converted from pixels to meters using the following conversion factor: 1 pixel = 0.3-meter conversion. I294_L1_RunX_with_lanes.png are the aerial reference images that define the geographic region and associated roadway segments of interest (see bounding boxes on northbound and southbound lanes) for each run X. I-294-L1-Run_X-geometry-with-ramps.csv contain the coordinates that define the lane cent

  16. Renewable Energy Generation Amount (kWh) by Renewable Energy Type |...

    • data.gov.hk
    Updated Nov 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.gov.hk (2025). Renewable Energy Generation Amount (kWh) by Renewable Energy Type | DATA.GOV.HK [Dataset]. https://data.gov.hk/en-data/dataset/hkelectric-cs_cbd-renewable-energy-generation-by-renewable-energy-type
    Explore at:
    Dataset updated
    Nov 24, 2025
    Dataset provided by
    data.gov.hk
    Description

    Provide the renewable energy generation amounts by renewable energy system type. The CSV file contains the renewable energy generation amounts from solar photovoltaic systems and wind power systems respectively.

  17. Replication Package of the paper "Large Language Models for Multilingual...

    • zenodo.org
    zip
    Updated Oct 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alessandro Midolo; Alessandro Midolo; Saima Afrin; Saima Afrin; Camilo Escobar-Velásquez; Camilo Escobar-Velásquez; Mario Linares-Vasquez; Mario Linares-Vasquez; Ding Weiyuan; Bowen Xu; Bowen Xu; Massimiliano Di Penta; Massimiliano Di Penta; Antonio Mastropaolo; Antonio Mastropaolo; Ding Weiyuan (2025). Replication Package of the paper "Large Language Models for Multilingual Code Generation: A Benchmark and a Study on Code Quality" [Dataset]. http://doi.org/10.5281/zenodo.17259178
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 3, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Alessandro Midolo; Alessandro Midolo; Saima Afrin; Saima Afrin; Camilo Escobar-Velásquez; Camilo Escobar-Velásquez; Mario Linares-Vasquez; Mario Linares-Vasquez; Ding Weiyuan; Bowen Xu; Bowen Xu; Massimiliano Di Penta; Massimiliano Di Penta; Antonio Mastropaolo; Antonio Mastropaolo; Ding Weiyuan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Large Language Models for Multilingual Code Generation: A Benchmark and a Study on Code Quality

    Abstract

    Having been trained in the wild, Large Language Models (LLMs) may suffer from different types of bias. As shown in previous studies outside software engineering, this includes a language bias, i.e., these models perform differently depending on the language used for the query/prompt. However, so far the impact of language bias on source code generation has not been thoroughly investigated. Therefore, in this paper, we study the influence of the language adopted in the prompt on the quality of the source code generated by three LLMs, specifically GPT, Claude, and DeepSeek. We consider 230 coding tasks for Python and 230 for Java, and translate their related prompts into four languages: Chinese, Hindi, Spanish, and Italian. After generating the code, we measure code quality in terms of passed tests, code metrics, warnings generated by static analysis tools, and language used for the identifiers. Results indicate that (i) source code generated from the English queries is not necessarily better in terms of passed test and quality metrics, (ii) the quality for different languages varies depending on the programming language and LLM being used, and (iii) the generated code tend to contain mixes of comments and literals written in English and the language used to formulate the prompt.

    Replication Package

    This replication package is organized into two main directories: data and scripts. The datadirectory contains all the data used in the analysis, including prompts and final results. The scripts directory contains all the Python scripts used for code generation and analysis.

    Data

    The data directory contains five subdirectories, each corresponding to a stage in the analysis pipeline. These are enumerated to reflect the order of the process:

    1. prompt_translation: Contains files with manually translated prompts for each language. Each file is associated with both Python and Java. The structure of each file is as follows:

      • id: The ID of the query in the CoderEval benchmark.
      • prompt: The original English prompt.
      • summary: The original summary.
      • code: The original code.
      • translation: The translation generated by GPT.
      • correction: The manual correction of the GPT-generated translation.
      • correction_tag: A list of tags indicating the corrections made to the translation.
      • generated_code: This column is initially empty and will contain the code generated from the translated prompt.
    2. generation: Contains the code generated by the three LLMs for each programming language and natural language. Each subdirectory (e.g., java_chinese_claude) contains the following:

      • files: The files with the generated code (named by the query ID).
      • report: Reports generated by static analysis tools.
      • A CSV file (e.g., java_chinese_claude.csv) containing the generated code in the corresponding column.
    3. tests: Contains input files for the testing process and the results of the tests. Files in the input_files directory are formatted according to the CoderEval benchmark requirements. The results directory holds the output of the testing process.

    4. quantitative_analysis: Contains all the csv reports of the static analysis tools and test output for all languages and models. These files are the inputs for the statistical analysis. The directory stats contains all the output tables for the statistical analysis, which are shown in paper's tables.

    5. qualitative_analysis: Contains files used for the qualitative analysis:

      • CohenKappaagreement.csv: A file containing the subset used to compute Cohen's kappa metrics for manual analysis.
      • files: Contains all files for the qualitative analysis. Each file has the following columns:
        • id: The ID of the query in the CoderEval benchmark.
        • generated_code: The code generated by the model.
        • comments: The language used for comments.
        • identifiers: The language used for identifiers.
        • literals: The language used for literals.
        • notes: Additional notes.
    6. ablation_study: Contains files for the ablation study. Each file has the following columns:

      • id: The ID of the query in the CoderEval benchmark.
      • prompt: The prompt used for code generation.
      • generated_code, comments, identifiers, and literals: Same as in the qualitative analysis. results.pdf: This file shows the table containing all the percentages of comments, identifiers and literals extracted from the csv files of the ablation study.

      Files prefixed with italian contain prompts with signatures and docstrings translated into Italian. The system prompt used is the same as the initial one (see the paper). Files with the english prefix have prompts with the original signature (in English) and the docstring in Italian. The system prompt differs as follows:

    You are an AI that only responds with Python code. You will be given a function signature and its docstring by the user. Write your full implementation (restate the function signature).
    Use a Python code block to write your response.
    Comments and identifiers must be in Italian. 
    For example:
    ```python
    print("Hello World!")

    Scripts

    The scripts directory contains all the scripts used to perform all the generations and analysis. All files are properly commented. Here a brief description of each file:

    • code_generation.py: This script automates code generation using AI models (GPT, DeepSeek, and Claude) for different programming and natural languages. It reads prompts from CSV files, generates code based on the prompts, and saves the results in structured directories. It logs the process, handles errors, and stores the generated code in separate files for each iteration.

    • computeallanalysis.py: This script performs static code analysis on generated code files using different models, languages, and programming languages. It runs various analyses (Flake8, Pylint, Lizard) depending on the programming language: for Python, it runs all three analyses, while for Java, only Lizard is executed. The results are stored in dedicated report directories for each iteration. The script ensures the creation of necessary directories and handles any errors that occur during the analysis process.

    • createtestjava.py: This script processes Java code generated by different models and languages, extracting methods using a JavaParser server. It iterates through multiple iterations of generated code, extracts the relevant method code (or uses the full code if no method is found), and stores the results in a JSONL file for each language and model combination.

    • deepseek_model.py: This function sends a request to the DeepSeek API, passing a system and user prompt, and extracts the generated code snippet based on the specified programming language. It prints the extracted code in blue to the console, and if any errors occur during the request or extraction, it prints an error message in red. If successful, it returns the extracted code snippet; otherwise, it returns None.

    • extractpmdreport.py: This script processes PMD analysis reports in SARIF format and converts them into CSV files. It extracts the contents of ZIP files containing the PMD reports, parses the SARIF file to gather analysis results, and saves the findings in a CSV file. The output includes details such as file names, rules, messages, and the count of issues found. The script iterates through multiple languages, models, and iterations, ensuring that PMD reports are properly processed and saved for each combination.

    • flake_analysis.py: The flake_analysis function runs Flake8 to analyze Python files for errors and generates a CSV report summarizing the results. It processes the output, extracting error details such as filenames, error codes, and messages. The errors are grouped by file and saved in a CSV file for easy review.

    • generatepredictionclaude_java.py: The generatecodefrom_prompt function processes a JSON file containing prompts, generates Java code using the Claude API, and saves the generated code to a new JSON file. It validates each prompt, ensures it's JSON-serializable, and sends it to the Claude API for code generation. If the generation is successful, the code is stored in a structured format, and the output is saved to a JSON file for further use.

    • generatepredictionclaude_python.py: This code defines a function generatecodefrom_prompt that processes a JSON file containing prompts, generates Python code using the Claude API, and saves the generated code to a new JSON file. It handles invalid values and ensures all prompts are

  18. f

    File S1 - Mynodbcsv: Lightweight Zero-Config Database Solution for Handling...

    • figshare.com
    zip
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stanisław Adaszewski (2023). File S1 - Mynodbcsv: Lightweight Zero-Config Database Solution for Handling Very Large CSV Files [Dataset]. http://doi.org/10.1371/journal.pone.0103319.s001
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Stanisław Adaszewski
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Set of Python scripts to generate data for benchmarks: equivalents of ADNI_Clin_6800_geno.csv, PTDEMOG.csv, MicroarrayExpression_fixed.csv and Probes.csv files, the dummy.csv, dummy2.csv and the microbenchmark CSV files. (ZIP)

  19. d

    Canopy reflectance spectra and photographs (raw data), Seward Peninsula,...

    • search.dataone.org
    • osti.gov
    Updated Jul 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dedi Yang; Wouter Hanston; Kenneth Davidson; Shawn Serbin (2024). Canopy reflectance spectra and photographs (raw data), Seward Peninsula, Alaska, 2022 [Dataset]. http://doi.org/10.15485/2395958
    Explore at:
    Dataset updated
    Jul 10, 2024
    Dataset provided by
    ESS-DIVE
    Authors
    Dedi Yang; Wouter Hanston; Kenneth Davidson; Shawn Serbin
    Time period covered
    Jul 12, 2022 - Jul 28, 2022
    Area covered
    Description

    Measurements of full-range (350–2500 nm) canopy spectral reflectance of Arctic plant species, plots, and transects at the Next Generation Ecosystem Experiment Arctic (NGEE Arctic) Teller Mile Marker (MM27) and Kougarok Fire Complex (KFC) sites, Seward Peninsula, Alaska. Spectra were collected in July 2022 using a handheld SVC HR-2014i spectroradiometer. All spectra were collected as calibrated surface radiance and converted to surface reflectance using 99.99% reflective Spectralon white reference standard. This data package includes unprocessed instrument output of the spectra signals (.sig) and, for some canopy measurements, photographs of the target taken by the SVC instrument camera or handheld digital camera (.jpg), GPS locations and file metadata (.csv). The Next-Generation Ecosystem Experiments: Arctic (NGEE Arctic), was a 15-year research effort (2012-2027) to reduce uncertainty in Earth System Models by developing a predictive understanding of carbon-rich Arctic ecosystems and feedbacks to climate. NGEE Arctic was supported by the Department of Energy's Office of Biological and Environmental Research. The NGEE Arctic project had two field research sites: 1) located within the Arctic polygonal tundra coastal region on the Barrow Environmental Observatory (BEO) and the North Slope near Utqiagvik (Barrow), Alaska and 2) multiple areas on the discontinuous permafrost region of the Seward Peninsula north of Nome, Alaska. Through observations, experiments, and synthesis with existing datasets, NGEE Arctic provided an enhanced knowledge base for multi-scale modeling and contributed to improved process representation at global pan-Arctic scales within the Department of Energy's Earth system Model (the Energy Exascale Earth System Model, or E3SM), and specifically within the E3SM Land Model component (ELM).

  20. Data from: Data and code from: Environmental influences on drying rate of...

    • catalog.data.gov
    • datasetcatalog.nlm.nih.gov
    • +2more
    Updated Apr 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Data and code from: Environmental influences on drying rate of spray applied disinfestants from horticultural production services [Dataset]. https://catalog.data.gov/dataset/data-and-code-from-environmental-influences-on-drying-rate-of-spray-applied-disinfestants-
    Explore at:
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    Agricultural Research Servicehttps://www.ars.usda.gov/
    Description

    This dataset includes all the data and R code needed to reproduce the analyses in a forthcoming manuscript:Copes, W. E., Q. D. Read, and B. J. Smith. Environmental influences on drying rate of spray applied disinfestants from horticultural production services. PhytoFrontiers, DOI pending.Study description: Instructions for disinfestants typically specify a dose and a contact time to kill plant pathogens on production surfaces. A problem occurs when disinfestants are applied to large production areas where the evaporation rate is affected by weather conditions. The common contact time recommendation of 10 min may not be achieved under hot, sunny conditions that promote fast drying. This study is an investigation into how the evaporation rates of six commercial disinfestants vary when applied to six types of substrate materials under cool to hot and cloudy to sunny weather conditions. Initially, disinfestants with low surface tension spread out to provide 100% coverage and disinfestants with high surface tension beaded up to provide about 60% coverage when applied to hard smooth surfaces. Disinfestants applied to porous materials were quickly absorbed into the body of the material, such as wood and concrete. Even though disinfestants evaporated faster under hot sunny conditions than under cool cloudy conditions, coverage was reduced considerably in the first 2.5 min under most weather conditions and reduced to less than or equal to 50% coverage by 5 min. Dataset contents: This dataset includes R code to import the data and fit Bayesian statistical models using the model fitting software CmdStan, interfaced with R using the packages brms and cmdstanr. The models (one for 2022 and one for 2023) compare how quickly different spray-applied disinfestants dry, depending on what chemical was sprayed, what surface material it was sprayed onto, and what the weather conditions were at the time. Next, the statistical models are used to generate predictions and compare mean drying rates between the disinfestants, surface materials, and weather conditions. Finally, tables and figures are created. These files are included:Drying2022.csv: drying rate data for the 2022 experimental runWeather2022.csv: weather data for the 2022 experimental runDrying2023.csv: drying rate data for the 2023 experimental runWeather2023.csv: weather data for the 2023 experimental rundisinfestant_drying_analysis.Rmd: RMarkdown notebook with all data processing, analysis, and table creation codedisinfestant_drying_analysis.html: rendered output of notebookMS_figures.R: additional R code to create figures formatted for journal requirementsfit2022_discretetime_weather_solar.rds: fitted brms model object for 2022. This will allow users to reproduce the model prediction results without having to refit the model, which was originally fit on a high-performance computing clusterfit2023_discretetime_weather_solar.rds: fitted brms model object for 2023data_dictionary.xlsx: descriptions of each column in the CSV data files

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Jensin (2025). AI_World_Generator_csv [Dataset]. https://huggingface.co/datasets/Jensin/AI_World_Generator_csv

AI_World_Generator_csv

AI World Generator.csv

Jensin/AI_World_Generator_csv

Explore at:
Dataset updated
Aug 3, 2025
Authors
Jensin
Area covered
World
Description

Jensin/AI_World_Generator_csv dataset hosted on Hugging Face and contributed by the HF Datasets community

Search
Clear search
Close search
Google apps
Main menu