100+ datasets found
  1. h

    example-generate-preference-dataset

    • huggingface.co
    Updated Aug 23, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    distilabel-internal-testing (2024). example-generate-preference-dataset [Dataset]. https://huggingface.co/datasets/distilabel-internal-testing/example-generate-preference-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 23, 2024
    Dataset authored and provided by
    distilabel-internal-testing
    Description

    Dataset Card for example-preference-dataset

    This dataset has been created with distilabel.

      Dataset Summary
    

    This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/sdiazlor/example-preference-dataset/raw/main/pipeline.yaml"

    or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/distilabel-internal-testing/example-generate-preference-dataset.

  2. Data generation volume worldwide 2010-2029

    • statista.com
    Updated Nov 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Data generation volume worldwide 2010-2029 [Dataset]. https://www.statista.com/statistics/871513/worldwide-data-created/
    Explore at:
    Dataset updated
    Nov 19, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    Worldwide
    Description

    The total amount of data created, captured, copied, and consumed globally is forecast to increase rapidly. While it was estimated at ***** zettabytes in 2025, the forecast for 2029 stands at ***** zettabytes. Thus, global data generation will triple between 2025 and 2029. Data creation has been expanding continuously over the past decade. In 2020, the growth was higher than previously expected, caused by the increased demand due to the coronavirus (COVID-19) pandemic, as more people worked and learned from home and used home entertainment options more often.

  3. SVG Code Generation Sample Training Data

    • kaggle.com
    zip
    Updated May 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vinothkumar Sekar (2025). SVG Code Generation Sample Training Data [Dataset]. https://www.kaggle.com/datasets/vinothkumarsekar89/svg-generation-sample-training-data
    Explore at:
    zip(193477 bytes)Available download formats
    Dataset updated
    May 3, 2025
    Authors
    Vinothkumar Sekar
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This training data was generated using GPT-4o as part of the 'Drawing with LLM' competition (https://www.kaggle.com/competitions/drawing-with-llms). It can be used to fine-tune small language models for the competition or serve as an augmentation dataset alongside other data sources.

    The dataset is generated in two steps using the GPT-4o model. - In the first step, topic descriptions relevant to the competition are generated using a specific prompt. By running this prompt multiple times, over 3,000 descriptions were collected.

     
    prompt=f""" I am participating in an SVG code generation competition.
      
       The competition involves generating SVG images based on short textual descriptions of everyday objects and scenes, spanning a wide range of categories. The key guidelines are as follows:
      
       - Descriptions are generic and do not contain brand names, trademarks, or personal names.
       - No descriptions include people, even in generic terms.
       - Descriptions are concise—each is no more than 200 characters, with an average length of about 50 characters.
       - Categories cover various domains, with some overlap between public and private test sets.
      
       To train a small LLM model, I am preparing a synthetic dataset. Could you generate 100 unique topics aligned with the competition style?
      
       Requirements:
       - Each topic should range between **20 and 200 characters**, with an **average around 60 characters**.
       - Ensure **diversity and creativity** across topics.
       - **50% of the topics** should come from the categories of **landscapes**, **abstract art**, and **fashion**.
       - Avoid duplication or overly similar phrasing.
      
       Example topics:
                     a purple forest at dusk, gray wool coat with a faux fur collar, a lighthouse overlooking the ocean, burgundy corduroy, pants with patch pockets and silver buttons, orange corduroy overalls, a purple silk scarf with tassel trim, a green lagoon under a cloudy sky, crimson rectangles forming a chaotic grid,  purple pyramids spiraling around a bronze cone, magenta trapezoids layered on a translucent silver sheet,  a snowy plain, black and white checkered pants,  a starlit night over snow-covered peaks, khaki triangles and azure crescents,  a maroon dodecahedron interwoven with teal threads.
      
       Please return the 100 topics in csv format.
       """
     
    • In the second step, SVG code is generated by prompting the GPT-4o model. The following prompt is used to query the model to generate svg.
     
      prompt = f"""
          Generate SVG code to visually represent the following text description, while respecting the given constraints.
          
          Allowed Elements: `svg`, `path`, `circle`, `rect`, `ellipse`, `line`, `polyline`, `polygon`, `g`, `linearGradient`, `radialGradient`, `stop`, `defs`
          Allowed Attributes: `viewBox`, `width`, `height`, `fill`, `stroke`, `stroke-width`, `d`, `cx`, `cy`, `r`, `x`, `y`, `rx`, `ry`, `x1`, `y1`, `x2`, `y2`, `points`, `transform`, `opacity`
          
    
          Please ensure that the generated SVG code is well-formed, valid, and strictly adheres to these constraints. 
          Focus on a clear and concise representation of the input description within the given limitations. 
          Always give the complete SVG code with nothing omitted. Never use an ellipsis.
    
          The code is scored based on similarity to the description, Visual question anwering and aesthetic components.
          Please generate a detailed svg code accordingly.
    
          input description: {text}
          """
     

    The raw SVG output is then cleaned and sanitized using a competition-specific sanitization class. After that, the cleaned SVG is scored using the SigLIP model to evaluate text-to-SVG similarity. Only SVGs with a score above 0.5 are included in the dataset. On average, out of three SVG generations, only one meets the quality threshold after the cleaning, sanitization, and scoring process.

    A dataset with ~50,000 samples for SVG code generation is publicly available at: https://huggingface.co/datasets/vinoku89/svg-code-generation

  4. Data used by EPA researchers to generate illustrative figures for overview...

    • catalog.data.gov
    • datasets.ai
    • +1more
    Updated Nov 14, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2020). Data used by EPA researchers to generate illustrative figures for overview article "Multiscale Modeling of Background Ozone: Research Needs to Inform and Improve Air Quality Management" [Dataset]. https://catalog.data.gov/dataset/data-used-by-epa-researchers-to-generate-illustrative-figures-for-overview-article-multisc
    Explore at:
    Dataset updated
    Nov 14, 2020
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    Data sets used to prepare illustrative figures for the overview article “Multiscale Modeling of Background Ozone” Overview The CMAQ model output datasets used to create illustrative figures for this overview article were generated by scientists in EPA/ORD/CEMM and EPA/OAR/OAQPS. The EPA/ORD/CEMM-generated dataset consisted of hourly CMAQ output from two simulations. The first simulation was performed for July 1 – 31 over a 12 km modeling domain covering the Western U.S. The simulation was configured with the Integrated Source Apportionment Method (ISAM) to estimate the contributions from 9 source categories to modeled ozone. ISAM source contributions for July 17 – 31 averaged over all grid cells located in Colorado were used to generate the illustrative pie chart in the overview article. The second simulation was performed for October 1, 2013 – August 31, 2014 over a 108 km modeling domain covering the northern hemisphere. This simulation was also configured with ISAM to estimate the contributions from non-US anthropogenic sources, natural sources, stratospheric ozone, and other sources on ozone concentrations. Ozone ISAM results from this simulation were extracted along a boundary curtain of the 12 km modeling domain specified over the Western U.S. for the time period January 1, 2014 – July 31, 2014 and used to generate the illustrative time-height cross-sections in the overview article. The EPA/OAR/OAQPS-generated dataset consisted of hourly gridded CMAQ output for surface ozone concentrations for the year 2016. The CMAQ simulations were performed over the northern hemisphere at a horizontal resolution of 108 km. NO2 and O3 data for July 2016 was extracted from these simulations generate the vertically-integrated column densities shown in the illustrative comparison to satellite-derived column densities. CMAQ Model Data The data from the CMAQ model simulations used in this research effort are very large (several terabytes) and cannot be uploaded to ScienceHub due to size restrictions. The model simulations are stored on the /asm archival system accessible through the atmos high-performance computing (HPC) system. Due to data management policies, files on /asm are subject to expiry depending on the template of the project. Files not requested for extension after the expiry date are deleted permanently from the system. The format of the files used in this analysis and listed below is ioapi/netcdf. Documentation of this format, including definitions of the geographical projection attributes contained in the file headers, are available at https://www.cmascenter.org/ioapi/ Documentation on the CMAQ model, including a description of the output file format and output model species can be found in the CMAQ documentation on the CMAQ GitHub site at https://github.com/USEPA/CMAQ. This dataset is associated with the following publication: Hogrefe, C., B. Henderson, G. Tonnesen, R. Mathur, and R. Matichuk. Multiscale Modeling of Background Ozone: Research Needs to Inform and Improve Air Quality Management. EM Magazine. Air and Waste Management Association, Pittsburgh, PA, USA, 1-6, (2020).

  5. D

    Test Data Generation AI Market Research Report 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Oct 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). Test Data Generation AI Market Research Report 2033 [Dataset]. https://dataintelo.com/report/test-data-generation-ai-market
    Explore at:
    pptx, pdf, csvAvailable download formats
    Dataset updated
    Oct 1, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Test Data Generation AI Market Outlook



    According to our latest research, the global Test Data Generation AI market size reached USD 1.29 billion in 2024 and is projected to grow at a robust CAGR of 24.7% from 2025 to 2033. By the end of the forecast period in 2033, the market is anticipated to attain a value of USD 10.1 billion. This substantial growth is primarily driven by the increasing complexity of software systems, the rising need for high-quality, compliant test data, and the rapid adoption of AI-driven automation across diverse industries.



    The accelerating digital transformation across sectors such as BFSI, healthcare, and retail is one of the core growth factors propelling the Test Data Generation AI market. Organizations are under mounting pressure to deliver software faster, with higher quality and reduced risk, especially as business models become more data-driven and customer expectations for seamless digital experiences intensify. AI-powered test data generation tools are proving indispensable by automating the creation of realistic, diverse, and compliant test datasets, thereby enabling faster and more reliable software testing cycles. Furthermore, the proliferation of agile and DevOps practices is amplifying the demand for continuous testing environments, where the ability to generate synthetic test data on demand is a critical enabler of speed and innovation.



    Another significant driver is the escalating emphasis on data privacy, security, and regulatory compliance. With stringent regulations such as GDPR, HIPAA, and CCPA in place, enterprises are compelled to ensure that non-production environments do not expose sensitive information. Test Data Generation AI solutions excel at creating anonymized or masked data sets that maintain the statistical properties of production data while eliminating privacy risks. This capability not only addresses compliance mandates but also empowers organizations to safely test new features, integrations, and applications without compromising user confidentiality. The growing awareness of these compliance imperatives is expected to further accelerate the adoption of AI-driven test data generation tools across regulated industries.



    The ongoing evolution of AI and machine learning technologies is also enhancing the capabilities and appeal of Test Data Generation AI solutions. Advanced algorithms can now analyze complex data models, understand interdependencies, and generate highly realistic test data that mirrors production environments. This sophistication enables organizations to uncover hidden defects, improve test coverage, and simulate edge cases that would be challenging to create manually. As AI models continue to mature, the accuracy, scalability, and adaptability of test data generation platforms are expected to reach new heights, making them a strategic asset for enterprises striving for digital excellence and operational resilience.



    Regionally, North America continues to dominate the Test Data Generation AI market, accounting for the largest revenue share in 2024, followed closely by Europe and Asia Pacific. The United States, in particular, is at the forefront due to its advanced technology ecosystem, early adoption of AI solutions, and the presence of leading software and cloud service providers. However, Asia Pacific is emerging as a high-growth region, fueled by rapid digitalization, expanding IT infrastructure, and increasing investments in AI research and development. Europe remains a key market, underpinned by strong regulatory frameworks and a growing focus on data privacy. Latin America and the Middle East & Africa, while still nascent, are exhibiting steady growth as enterprises in these regions recognize the value of AI-driven test data solutions for competitive differentiation and compliance assurance.



    Component Analysis



    The Test Data Generation AI market by component is segmented into Software and Services, each playing a pivotal role in driving the overall market expansion. The software segment commands the lion’s share of the market, as organizations increasingly prioritize automation and scalability in their test data generation processes. AI-powered software platforms offer a suite of features, including data profiling, masking, subsetting, and synthetic data creation, which are integral to modern DevOps and continuous integration/continuous deployment (CI/CD) pipelines. These platforms are designed to seamlessly integrate with existing testing tools, datab

  6. f

    This file contains the source data used to generate every figure (main and...

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Oct 16, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin, Brigitte E.; Sun, Jiayi; Harris, Jeremy D.; Brooke, Christopher B.; Koelle, Katia (2020). This file contains the source data used to generate every figure (main and supplemental) in this manuscript. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000483243
    Explore at:
    Dataset updated
    Oct 16, 2020
    Authors
    Martin, Brigitte E.; Sun, Jiayi; Harris, Jeremy D.; Brooke, Christopher B.; Koelle, Katia
    Description

    Each tab of the Excel file includes figure panels, which are often grouped according to cell line, MDCK or A549 cells. For data that was used in multiple figures, we included these only once and made a note within the sheet of any other figures that also show these data. Oftentimes, the data are included in both a main figure (found in the text) and one or more supplemental figures; in these cases, we labeled the tabs according to the main figure. (XLSX)

  7. Data used to produce figures and tables

    • catalog.data.gov
    • s.cnmilf.com
    • +1more
    Updated May 15, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2021). Data used to produce figures and tables [Dataset]. https://catalog.data.gov/dataset/data-used-to-produce-figures-and-tables-c6864
    Explore at:
    Dataset updated
    May 15, 2021
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    The data set was used to produce tables and figures in paper. This dataset is associated with the following publications: Lytle, D., S. Pfaller, C. Muhlen, I. Struewing, S. Triantafyllidou, C. White, S. Hayes, D. King, and J. Lu. A Comprehensive Evaluation of Monochloramine Disinfection on Water Quality, Legionella and Other Important Microorganisms in a Hospital. WATER RESEARCH. Elsevier Science Ltd, New York, NY, USA, 189: 116656, (2021). Lytle, D., C. Formal, K. Cahalan, C. Muhlen, and S. Triantafyllidou. The Impact of Sampling Approach and Daily Water Usage on Lead Levels Measured at the Tap. WATER RESEARCH. Elsevier Science Ltd, New York, NY, USA, 197: 117071, (2021).

  8. LLM Prompt Recovery Data: Gemini and Gemma

    • kaggle.com
    zip
    Updated Mar 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Newton Baba (2024). LLM Prompt Recovery Data: Gemini and Gemma [Dataset]. https://www.kaggle.com/datasets/newtonbaba12345/llm-prompt-recovery-data-gemini-and-gemma
    Explore at:
    zip(2048938 bytes)Available download formats
    Dataset updated
    Mar 2, 2024
    Authors
    Newton Baba
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    For generated by gemini :

    original_text- The prompt the essay was written in response to.

    prompt - The prompt provided to Gemini to rewritten the text

    rewritten_text - The output from Gemini.

    For generated by Gemma :

    original_text- The prompt the essay was written in response to.

    rewritten_prompt - The prompt provided to Gemma to rewritten the text

    rewritten_text - The output from Gemma.

  9. Data used to produce figures and tables

    • catalog.data.gov
    Updated Apr 12, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2021). Data used to produce figures and tables [Dataset]. https://catalog.data.gov/dataset/data-used-to-produce-figures-and-tables-6bca2
    Explore at:
    Dataset updated
    Apr 12, 2021
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    The dataset are the data used to produce figure in manuscript. This dataset is associated with the following publication: Tang, M., D. Lytle, and J. Botkins. Accumulation and Release of Arsenic from Cast Iron: Impact of Initial Arsenic and Orthophosphate Concentrations. WATER RESEARCH. Elsevier Science Ltd, New York, NY, USA, 194: 116942, (2021).

  10. Data Engg data

    • kaggle.com
    Updated Jun 26, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Apurba Sarkar (2021). Data Engg data [Dataset]. https://www.kaggle.com/apurbasarkar/data-engg-data/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 26, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Apurba Sarkar
    Description

    Now based on the above two tables (UserTable and VisitorLogData), you need to create an input feature set for the Marketing Model.

    Input Feature table:

    UserID

    Unique ID of the registered user

    No_of_days_Visited_7_Days

    How many days a user was active on platform in the last 7 days.

    No_Of_Products_Viewed_15_Days

    Number of Products viewed by the user in the last 15 days

    User_Vintage

    Vintage (In Days) of the user as of today

    Most_Viewed_product_15_Days

    Most frequently viewed (page loads) product by the user in the last 15 days. If there are multiple products that have a similar number of page loads then , consider the recent one. If a user has not viewed any product in the last 15 days then put it as Product101.

    Most_Active_OS

    Most Frequently used OS by user.

    Recently_Viewed_Product

    Most recently viewed (page loads) product by the user. If a user has not viewed any product then put it as Product101.

    Pageloads_last_7_days

    Count of Page loads in the last 7 days by the user

    Clicks_last_7_days

    Count of Clicks in the last 7 days by the user

    Process to create Input Feature:

    In the current case, you are supposed to generate an input feature set as on 28-May-2018. So, the visitor table is from 07-May-2018 to 27-May-2018.

    As a Data Engineer Creating ETL Pipeline would definitely be appreciated and provide you the added advantage in interviews, Your effort should be to build ETL Pipeline such that passing the information of user data and log data, It can generate the input feature table automatically

  11. Big data and business analytics revenue worldwide 2015-2022

    • statista.com
    Updated Aug 17, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2021). Big data and business analytics revenue worldwide 2015-2022 [Dataset]. https://www.statista.com/statistics/551501/worldwide-big-data-business-analytics-revenue/
    Explore at:
    Dataset updated
    Aug 17, 2021
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    Worldwide
    Description

    The global big data and business analytics (BDA) market was valued at ***** billion U.S. dollars in 2018 and is forecast to grow to ***** billion U.S. dollars by 2021. In 2021, more than half of BDA spending will go towards services. IT services is projected to make up around ** billion U.S. dollars, and business services will account for the remainder. Big data High volume, high velocity and high variety: one or more of these characteristics is used to define big data, the kind of data sets that are too large or too complex for traditional data processing applications. Fast-growing mobile data traffic, cloud computing traffic, as well as the rapid development of technologies such as artificial intelligence (AI) and the Internet of Things (IoT) all contribute to the increasing volume and complexity of data sets. For example, connected IoT devices are projected to generate **** ZBs of data in 2025. Business analytics Advanced analytics tools, such as predictive analytics and data mining, help to extract value from the data and generate business insights. The size of the business intelligence and analytics software application market is forecast to reach around **** billion U.S. dollars in 2022. Growth in this market is driven by a focus on digital transformation, a demand for data visualization dashboards, and an increased adoption of cloud.

  12. d

    Data from: On-farm wildflower plantings generate opposing reproductive...

    • catalog.data.gov
    • agdatacommons.nal.usda.gov
    Updated Jul 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Data from: On-farm wildflower plantings generate opposing reproductive outcomes for solitary and bumble bee species [Dataset]. https://catalog.data.gov/dataset/data-from-on-farm-wildflower-plantings-generate-opposing-reproductive-outcomes-for-solitar
    Explore at:
    Dataset updated
    Jul 11, 2025
    Dataset provided by
    Agricultural Research Service
    Description

    Pollinator habitat can be planted on farms to enhance floral and nesting resources, and subsequently, pollinator populations. There is ample evidence linking such plantings to greater pollinator abundance on farms, but less is known about their effects on pollinator reproduction. We placed Bombus impatiens Cresson (Hymenoptera: Apidae) and Megachile rotundata (F.) (Hymenoptera: Megachilidae) nests out on 19 Mid-Atlantic farms in 2018, where half (n=10) the farms had established wildflower plantings and half (n=9) did not. Bombus impatiens nests were placed at each farm in spring and mid-summer and repeatedly weighed to capture colony growth. We quantified the relative production of reproductive castes and assessed parasitism rates by screening for conopid fly parasitism and Nosema spores within female workers. We also released M. rotundata cocoons at each farm in spring and collected new nests and emergent adult offspring over the next year, recording female weight as an indicator of reproductive potential and quantifying Nosema parasitism and parasitoid infection rates. Bombus impatiens nests gained less weight and contained female workers with Nosema spore loads over 150x greater on farms with wildflower plantings. In contrast, M. rotundata female offspring weighed more on farms with wildflower plantings and marginally less on farms with honey bee hives. We conclude that wildflower plantings likely enhance reproduction in some species, but that they could also enhance microsporidian parasitism rates in susceptible bee species. It will be important to determine how wildflower planting benefits can be harnessed while minimizing parasitism in wild and managed bee species.

  13. f

    Excel spreadsheet of raw data used to generate Fig 2F.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated May 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pattamatta, Ushasree; White, Andrew; Cunningham, Anthony L.; Rana, Hafsa; Arshad, Sana; Carnt, Nicole A.; Truong, Naomi R.; Chinnery, Holly R.; Bertram, Kirstie M. (2025). Excel spreadsheet of raw data used to generate Fig 2F. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0002085143
    Explore at:
    Dataset updated
    May 2, 2025
    Authors
    Pattamatta, Ushasree; White, Andrew; Cunningham, Anthony L.; Rana, Hafsa; Arshad, Sana; Carnt, Nicole A.; Truong, Naomi R.; Chinnery, Holly R.; Bertram, Kirstie M.
    Description

    Excel spreadsheet of raw data used to generate Fig 2F.

  14. User Subscription Dummy Data

    • kaggle.com
    Updated Sep 7, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nitin Choudhary (2022). User Subscription Dummy Data [Dataset]. https://www.kaggle.com/datasets/nitinchoudhary012/user-subscription-dummy-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 7, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Nitin Choudhary
    Description

    This data is purely random and created for learning purpose.

    In situations where data is not readily available but needed, you'll have to resort to building up the data yourself. There are many methods you can use to acquire this data from web scraping to APIs. But sometimes, you'll end up needing to create fake or “dummy” data. Dummy data can be useful in times where you know the exact features you’ll be using and the data types included but, you just don’t have the data itself.

    Features Description

    • ID — a unique string of characters to identify each user.
    • Gender — string data type of three choices.
    • Subscriber — a binary True/False choice of their subscription status.
    • Name — string data type of the first and last name of the user.
    • Email —string data type of the email address of the user.
    • Last Login — string data type of the last login time.
    • Date of Birth — string format of year-month-day.
    • Education — current education level as a string data type.
    • Bio — short string descriptions of random words.
    • Rating — integer type of a 1 through 5 rating of something.

    Note - This Data is Purely Random (Dummy Data). if you wish, you can perform some data visualization and model building part into it.

    Reference - https://towardsdatascience.com/build-a-your-own-custom-dataset-using-python-9296540a0178

  15. Randomised Synthetic Online Game Purchases Data

    • kaggle.com
    zip
    Updated Apr 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    zaclovell (2022). Randomised Synthetic Online Game Purchases Data [Dataset]. https://www.kaggle.com/datasets/zaclovell/randomised-synthetic-online-game-purchases-data
    Explore at:
    zip(1208739 bytes)Available download formats
    Dataset updated
    Apr 24, 2022
    Authors
    zaclovell
    Description

    1. Why build a dataset?

    I wanted to run data analysis and machine learning on a large dataset to build my data science skills but I felt out of touch with the various datasets available so I thought... how about I try and build my own dataset?

    2. Why gaming data?

    I wondered what data should be in the dataset and settled with online digital game purchases since I am an avid gamer. Imagine getting sales data from the PlayStation Store or Xbox Microsoft Store, this is what I was aiming to replicate.

    3. Scope of the dataset

    I envisaged the dataset to be data created through the purchase of a digital game on either the UK PlayStation Store or Xbox Microsoft Store. Considering this, the scope of dataset varies depending on which column of data you are viewing, for example: - Date and Time: purchases were defined between a start/end date (this can be altered, see point 4) and, of course, anytime across the 24hr clock - Geographically: purchases were setup to come from any postcode in the UK - in total this is over 1,000,000 active postcodes - Purchases: the list of game titles available for purchase is 24 - Registered Banks: the list of registered banks in the UK (as of 03/2022) was 159

    4. Over 42,000 rows isn't enough?

    To generate the dataset, I built a function in Python. This function, when called with the number of rows you want in your dataset, will generate the dataset. For example, calling function(1000) will provide you with a dataset with 1000 rows.

    Considering this, if just over 42,000 rows of data (42,892 to be exact) isn't enough, feel free to check out the code on my GitHub to run the function yourself with as many rows as you want.

    Note: You can also edit the start/end dates of the function depending on which timespan you want the dataset to cover.

    5. Disclaimer - this is still a work in progress!

    Yes, as stated above, this dataset is still a work in progress and is therefore not 100% perfect. There is a backlog of issues that need to be resolved. Feel free to check out the backlog.

    One example of this is how on various columns, the distributions of data is equal, when in fact for the dataset to be entirely random, this should not be the case. An example of this issue is the Time column. These issues will be resolved in a later update.

    Last updated: 24/04/2022

  16. Z

    Data from: SQL Injection Attack Netflow

    • data.niaid.nih.gov
    • portalcienciaytecnologia.jcyl.es
    • +3more
    Updated Sep 28, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ignacio Crespo; Adrián Campazas (2022). SQL Injection Attack Netflow [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6907251
    Explore at:
    Dataset updated
    Sep 28, 2022
    Authors
    Ignacio Crespo; Adrián Campazas
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Introduction

    This datasets have SQL injection attacks (SLQIA) as malicious Netflow data. The attacks carried out are SQL injection for Union Query and Blind SQL injection. To perform the attacks, the SQLMAP tool has been used.

    NetFlow traffic has generated using DOROTHEA (DOcker-based fRamework fOr gaTHering nEtflow trAffic). NetFlow is a network protocol developed by Cisco for the collection and monitoring of network traffic flow data generated. A flow is defined as a unidirectional sequence of packets with some common properties that pass through a network device.

    Datasets

    The firts dataset was colleted to train the detection models (D1) and other collected using different attacks than those used in training to test the models and ensure their generalization (D2).

    The datasets contain both benign and malicious traffic. All collected datasets are balanced.

    The version of NetFlow used to build the datasets is 5.

        Dataset
        Aim
        Samples
        Benign-malicious
        traffic ratio
    
    
    
    
        D1
        Training
        400,003
        50%
    
    
        D2
        Test
        57,239
        50%
    

    Infrastructure and implementation

    Two sets of flow data were collected with DOROTHEA. DOROTHEA is a Docker-based framework for NetFlow data collection. It allows you to build interconnected virtual networks to generate and collect flow data using the NetFlow protocol. In DOROTHEA, network traffic packets are sent to a NetFlow generator that has a sensor ipt_netflow installed. The sensor consists of a module for the Linux kernel using Iptables, which processes the packets and converts them to NetFlow flows.

    DOROTHEA is configured to use Netflow V5 and export the flow after it is inactive for 15 seconds or after the flow is active for 1800 seconds (30 minutes)

    Benign traffic generation nodes simulate network traffic generated by real users, performing tasks such as searching in web browsers, sending emails, or establishing Secure Shell (SSH) connections. Such tasks run as Python scripts. Users may customize them or even incorporate their own. The network traffic is managed by a gateway that performs two main tasks. On the one hand, it routes packets to the Internet. On the other hand, it sends it to a NetFlow data generation node (this process is carried out similarly to packets received from the Internet).

    The malicious traffic collected (SQLI attacks) was performed using SQLMAP. SQLMAP is a penetration tool used to automate the process of detecting and exploiting SQL injection vulnerabilities.

    The attacks were executed on 16 nodes and launch SQLMAP with the parameters of the following table.

        Parameters
        Description
    
    
    
    
        '--banner','--current-user','--current-db','--hostname','--is-dba','--users','--passwords','--privileges','--roles','--dbs','--tables','--columns','--schema','--count','--dump','--comments', --schema'
        Enumerate users, password hashes, privileges, roles, databases, tables and columns
    
    
        --level=5
        Increase the probability of a false positive identification
    
    
        --risk=3
        Increase the probability of extracting data
    
    
        --random-agent
        Select the User-Agent randomly
    
    
        --batch
        Never ask for user input, use the default behavior
    
    
        --answers="follow=Y"
        Predefined answers to yes
    

    Every node executed SQLIA on 200 victim nodes. The victim nodes had deployed a web form vulnerable to Union-type injection attacks, which was connected to the MYSQL or SQLServer database engines (50% of the victim nodes deployed MySQL and the other 50% deployed SQLServer).

    The web service was accessible from ports 443 and 80, which are the ports typically used to deploy web services. The IP address space was 182.168.1.1/24 for the benign and malicious traffic-generating nodes. For victim nodes, the address space was 126.52.30.0/24. The malicious traffic in the test sets was collected under different conditions. For D1, SQLIA was performed using Union attacks on the MySQL and SQLServer databases.

    However, for D2, BlindSQL SQLIAs were performed against the web form connected to a PostgreSQL database. The IP address spaces of the networks were also different from those of D1. In D2, the IP address space was 152.148.48.1/24 for benign and malicious traffic generating nodes and 140.30.20.1/24 for victim nodes.

    To run the MySQL server we ran MariaDB version 10.4.12. Microsoft SQL Server 2017 Express and PostgreSQL version 13 were used.

  17. f

    Statistical tests and underlying data used to generate the graphs.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    • +1more
    Updated Aug 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aegerter-Wilmsen, Tinri; Hajnal, Alex; Laranjeira, Ana Cristina; Berger, Simon; Comi, Laura Filomena; deMello, Andrew; Kohlbrenner, Tea (2024). Statistical tests and underlying data used to generate the graphs. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001489508
    Explore at:
    Dataset updated
    Aug 23, 2024
    Authors
    Aegerter-Wilmsen, Tinri; Hajnal, Alex; Laranjeira, Ana Cristina; Berger, Simon; Comi, Laura Filomena; deMello, Andrew; Kohlbrenner, Tea
    Description

    Statistical tests and underlying data used to generate the graphs.

  18. i

    Code to generate keys

    • ieee-dataport.org
    Updated Dec 12, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiaohan Hao (2018). Code to generate keys [Dataset]. https://ieee-dataport.org/documents/code-generate-keys
    Explore at:
    Dataset updated
    Dec 12, 2018
    Authors
    Xiaohan Hao
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Key is the core element of constructing Bitcoin trust network. The key usually consists of private key and public key. The private key is used to generate signatures and the public key is used to generate addresses. Bitcoin keys are generated by the elliptic curve algorithm SECP256k1. This data set contains the core code to generate the key.

  19. Data from: Generation of Vessel Track Characteristics Using a Conditional...

    • tandf.figshare.com
    txt
    Updated Dec 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jessica N.A Campbell; Martha Dais Ferreira; Anthony W. Isenor (2024). Generation of Vessel Track Characteristics Using a Conditional Generative Adversarial Network (CGAN) [Dataset]. http://doi.org/10.6084/m9.figshare.25942783.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Dec 16, 2024
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Jessica N.A Campbell; Martha Dais Ferreira; Anthony W. Isenor
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Machine learning (ML) models often require large volumes of data to learn a given task. However, access and existence of training data can be difficult to acquire due to privacy laws and availability. A solution is to generate synthetic data that represents the real data. In the maritime environment, the ability to generate realistic vessel positional data is important for the development of ML models in ocean areas with scarce amounts of data, such as the Arctic, or for generating an abundance of anomalous or unique events needed for training detection models. This research explores the use of conditional generative adversarial networks (CGAN) to generate vessel displacement tracks over a 24-hour period in a constraint-free environment. The model is trained using Automatic Identification System (AIS) data that contains vessel tracking information. The results show that the CGAN is able to generate vessel displacement tracks for two different vessel types, cargo ships and pleasure crafts, for three months of the year (May, July, and September). To evaluate the usability of the generated data and robustness of the CGAN model, three ML vessel classification models using displacement track data are developed using generated data and tested with real data.

  20. f

    Data from: TAXODIUM Version 1.0: A Simple Way to Generate Uniform and...

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    • +1more
    Updated Nov 21, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Madorsky, Alexander; Mavrodiev, Evgeny V. (2012). TAXODIUM Version 1.0: A Simple Way to Generate Uniform and Fractionally Weighted Three-Item Matrices from Various Kinds of Biological Data [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001129735
    Explore at:
    Dataset updated
    Nov 21, 2012
    Authors
    Madorsky, Alexander; Mavrodiev, Evgeny V.
    Description

    An open-access program for generating three-item statement (3TS) matrices from data such as molecular sequences does not currently exist. The recently developed LisBeth package allows for representation of hypotheses of homology among taxa or areas directly as rooted trees or as hierarchies; however, LisBeth is not a standard matrix-based platform. Here we present “TAXODIUM version 1.0” (TAXODIUM), a program designed for building 3TS-matrices from binary, additive (ordered) and non-additive (unordered) multistate characters, with both uniform and fractional weighting of the statements. TAXODIUM also facilitates, for the first time, use of Maximum Likelihood analyses with 3TS matrices, but future implementation of the 3TS analysis in a statistical framework will require more exploration.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
distilabel-internal-testing (2024). example-generate-preference-dataset [Dataset]. https://huggingface.co/datasets/distilabel-internal-testing/example-generate-preference-dataset

example-generate-preference-dataset

distilabel-internal-testing/example-generate-preference-dataset

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 23, 2024
Dataset authored and provided by
distilabel-internal-testing
Description

Dataset Card for example-preference-dataset

This dataset has been created with distilabel.

  Dataset Summary

This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/sdiazlor/example-preference-dataset/raw/main/pipeline.yaml"

or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/distilabel-internal-testing/example-generate-preference-dataset.

Search
Clear search
Close search
Google apps
Main menu