100+ datasets found

Continuous Work History Sample
catalog.data.gov
datasets.ai
+3more
Updated Nov 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Social Security Administration (2025). Continuous Work History Sample [Dataset]. https://catalog.data.gov/dataset/continuous-work-history-sample
Explore at:
Dataset updated
Nov 22, 2025
Dataset provided by
Social Security Administrationhttp://ssa.gov/
Description
Provides an aggregate of data for the Office of the Actuary and the Office of Research, Evaluation and Statistics.
Mutual Information between Discrete and Continuous Data Sets
plos.figshare.com
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brian C. Ross (2023). Mutual Information between Discrete and Continuous Data Sets [Dataset]. http://doi.org/10.1371/journal.pone.0087357
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0087357
Dataset updated
May 30, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Brian C. Ross
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Mutual information (MI) is a powerful method for detecting relationships between data sets. There are accurate methods for estimating MI that avoid problems with “binning” when both data sets are discrete or when both data sets are continuous. We present an accurate, non-binning MI estimator for the case of one discrete data set and one continuous data set. This case applies when measuring, for example, the relationship between base sequence and gene expression level, or the effect of a cancer drug on patient survival time. We also show how our method can be adapted to calculate the Jensen–Shannon divergence of two or more data sets.
f
Sample descriptive statistics continuous variables (n = 346).
datasetcatalog.nlm.nih.gov
figshare.com
Updated Jul 6, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Balise, Raymond; Ragin, Camille; Moise, Rhoda K.; Kobetz, Erin (2021). Sample descriptive statistics continuous variables (n = 346). [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000753816
Explore at:
Dataset updated
Jul 6, 2021
Authors
Balise, Raymond; Ragin, Camille; Moise, Rhoda K.; Kobetz, Erin
Description
Sample descriptive statistics continuous variables (n = 346).
t
The Continuous Categorical: An Over- Simplex-Valued Exponential Family -...
service.tib.eu
resodate.org
Updated Jan 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). The Continuous Categorical: An Over- Simplex-Valued Exponential Family - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/the-continuous-categorical--an-over--simplex-valued-exponential-family
Explore at:
Dataset updated
Jan 3, 2025
Description
Simplex-valued data appear throughout statistics and machine learning, for example in the context of transfer learning and compression of deep networks.
RUL Dataset from Continuous Casting Machine
kaggle.com
zip
Updated Nov 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Iurii Katser (2023). RUL Dataset from Continuous Casting Machine [Dataset]. https://www.kaggle.com/datasets/yuriykatser/rul-dataset-from-continuous-casting-machine
Explore at:
zip(422843 bytes)Available download formats
Dataset updated
Nov 16, 2023
Authors
Iurii Katser
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Problem background and equipment description

A continuous casting machine (hereafter ‘CCM’) is a unit that transforms liquid steel into solid billets of a given section, from which rolling is subsequently produced (for example, rebars). The mould sleeve is the most critical and quickly worn part of the CCM mould. The sleeve is a water-cooled copper pipe with a round or profile section. The molten metal crystallizes in contact with the sleeve walls, and the primary solid shell of the ingot is formed. The main production issue that comes up during the operation of sleeves is that defects appear on the surface of the copper pipe of the sleeve and distort the profile of its inner cavity. This disrupts the thermal conditions, which in turn affects the quality of the resulting ingots. There can be shape defects (for example, the diagonals of a square ingot become unequal and the so-called ‘rhomboidity’ occurs), the dimensions of the sides can come out wrong, and the ingot corners may develop cracks. These defects lead to further problems in rolling: the decreased quality of products and the number of rejects adversely affect the economic efficiency of production. To prevent this, the sleeve dimensions are measured at certain intervals along the entire length. If these dimensions deviate from the design ones, the sleeve is rejected. Another issue is a shorter useful life of the copper sleeves of the mould used in production. This issue is often associated with a change in the operating parameters of the continuous casting machine itself. Such parameters include temperature of the incoming molten metal, temperature of cooling water and others. The actual useful life of a mould sleeve is often less than that stated by the manufacturer, which again leads to additional equipment downtime and increases the possibility of accidents and extra production costs. The expected useful life in tons should be as follows: - 17,000 tons for 180x180, - 13,000 tons for 150x150.

Data acquisition

In the course of CCM operation, the automatic control system that runs the process of casting ingots creates a database of casting parameters. The collected parameters are averaged data for all the strands in each cast; the only thing that differs is the resistance of the sleeve for each strand. After removing the mould sleeve for inspection, the initial data on the process parameters of casting, the geometry of obtained ingots and other attributes can be uploaded from the SCADA. The data were collected from a real production facility but after that they were processed, cleared, aggregated and prepared by the authors to solve the RUL problem.

RUL column

This column is formed from the column "resistance, tonn" where for each sleeve, num_crystallizer and num_stream from the highest resistance (moment of breaking) current value is subtracted.

Tasks

The main task, based on the dataset, was to develop a model for determining the remaining useful life in tons, or remaining casts, of the crystallizer sleeve (the ‘RUL problem’). It is recommended to solve the problem for each cast from the first to the last minus one. However, in addition to solving the RUL problem, it is always important for production to tackle the task of determining the main factors that influence the reduction and extension of the remaining useful life. This is relevant because many sleeves fail to operate up to the expected useful life, indicated above. To this end, you may want to set yourself to solve the following tasks: - identify the factors that affect the remaining useful life, - develop recommendations on how to increase it, - compare the performance of sleeves that have and have not had the target resistance and determine the parameters that brought this about.
d
Data from: Water-Quality Data for Discrete Samples and Continuous Monitoring...
catalog.data.gov
data.usgs.gov
+1more
Updated Oct 30, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Water-Quality Data for Discrete Samples and Continuous Monitoring on the Merrimack River, Massachusetts, June to September 2020 [Dataset]. https://catalog.data.gov/dataset/water-quality-data-for-discrete-samples-and-continuous-monitoring-on-the-merrimack-river-m
Explore at:
Dataset updated
Oct 30, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Merrimack River, Massachusetts
Description
This data release includes water-quality data collected at up to thirteen locations along the Merrimack River and Merrimack River Estuary in Massachusetts. In this study, conducted by the U.S. Geological Survey (USGS) in cooperation with the Massachusetts Department of Environmental Protection, discrete samples were collected, and continuous monitoring was completed from June to September 2020. The data include results of measured field properties (water temperature, specific conductivity, pH, dissolved oxygen) and laboratory concentrations of nitrogen and phosphorus species, total carbon, pheophytin-a, and chlorophyll-a. These data were collected to assess selected (mainly nutrients) water-quality conditions in the Merrimack River and Merrimack River Estuary at the thirteen locations and identify areas where more water-quality monitoring is needed. The discrete samples and continuous-monitoring data are also available in the USGS National Water Information System at https://waterdata.usgs.gov/nwis. This data release consists of (1) Table of the discrete water-quality data collected (Merrimack_DiscreteWQ_Data.csv); (2) Statistical summaries including the minimum, median, and maximum of the discrete water-quality data collected (Merrimack_DiscreteWQ_Statistical_Data.original.csv); (3) Statistical summaries including the minimum, median, and maximum of the continuous water-quality data collected (Merrimack_ContinuousWQ_Statistical_Data.csv); (4) Table of vertical profile data (Merrimack_VerticalWQ_Profiles_Data.csv); (5) Table of continuous monitor deployment location and dates (Merrimack_ContinuousWQ_Deployment_Dates.csv); (6) Time-series plots of continuous water-quality data (Continuous_QW_Plots_All.zip); (7) Vertical profile plots (Vertical Profiles_QW_Plots.zip).
d
Overcoming the pitfalls of categorizing continuous variables in ecology,...
search.dataone.org
data.niaid.nih.gov
+1more
Updated Jul 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Roxanne Beltran; Corey Tarwater (2024). Overcoming the pitfalls of categorizing continuous variables in ecology, evolution, and behavior [Dataset]. http://doi.org/10.5061/dryad.5x69p8d9r
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.5x69p8d9r
Dataset updated
Jul 12, 2024
Dataset provided by
Dryad Digital Repository
Authors
Roxanne Beltran; Corey Tarwater
Time period covered
Jan 1, 2023
Description
Many variables in biological research - from body size to life history timing to environmental characteristics - are measured continuously (e.g., body mass in kilograms) but analyzed as categories (e.g., large versus small), which can lower statistical power and change interpretation. We conducted a mini-review of 72 recent publications in six popular ecology, evolution, and behavior journals to quantify the prevalence of categorization. We then summarized commonly categorized metrics and simulated a dataset to demonstrate the drawbacks of categorization using common variables and realistic examples. We show that categorizing continuous variables is common (31% of publications reviewed). We also underscore that predictor variables can and should be collected and analyzed continuously. Finally, we provide recommendations on how to keep variables continuous throughout the entire scientific process. Together, these pieces comprise an actionable guide to increasing statistical power and fac..., , , # Overcoming the pitfalls of categorizing continuous variables in ecology and evolutionary biology

https://doi.org/10.5061/dryad.5x69p8d9r

We simulated data to quantify the detrimental impact of categorizing continuous variables using various statistical breakpoints and sample sizes (details below). To give the example biological relevance, we created a dataset that illustrates the complexity of life history theory and climate change impacts, and contains a predictor variable that is frequently categorized (Table 2) - reproductive timing in one year and its effect on body size in the following year. A reasonable research question would be: How does timing of reproduction in year t influence body mass at the start of the breeding season in year t+1? For illustrative purposes, letâ€™s say we collected data from individually banded penguins in Antarctica. Based on the mechanistic relationships between seasonally available sea ice and food availabi...
p
High Frequency Phone Survey, Continuous Data Collection 2023 - Papua New...
microdata.pacificdata.org
Updated Apr 30, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
William Seitz (2025). High Frequency Phone Survey, Continuous Data Collection 2023 - Papua New Guinea [Dataset]. https://microdata.pacificdata.org/index.php/catalog/877
Explore at:
Dataset updated
Apr 30, 2025
Dataset provided by
Darian Naidoo
William Seitz
Time period covered
2023 - 2025
Area covered
Papua New Guinea
Description
Abstract

Access to up-to-date socio-economic data is a widespread challenge in Papua New Guinea and other Pacific Island Countries. To increase data availability and promote evidence-based policymaking, the Pacific Observatory provides innovative solutions and data sources to complement existing survey data and analysis. One of these data sources is a series of High Frequency Phone Surveys (HFPS), which began in 2020 as a way to monitor the socio-economic impacts of the COVID-19 Pandemic, and since 2023 has grown into a series of continuous surveys for socio-economic monitoring. See https://www.worldbank.org/en/country/pacificislands/brief/the-pacific-observatory for further details.

For PNG, after five rounds of data collection from 2020-2022, in April 2023 a monthly HFPS data collection commenced and continued for 18 months (ending September 2024) –on topics including employment, income, food security, health, food prices, assets and well-being. This followed an initial pilot of the data collection from January 2023-March 2023. Data for April 2023-September 2023 were a repeated cross section, while October 2023 established the first month of a panel, which is ongoing as of March 2025. For each month, approximately 550-1000 households were interviewed. The sample is representative of urban and rural areas but is not representative at the province level. This dataset contains combined monthly survey data for all months of the continuous HFPS in PNG. There is one date file for household level data with a unique household ID, and separate files for individual level data within each household data, and household food price data, that can be matched to the household file using the household ID. A unique individual ID within the household data which can be used to track individuals over time within households.

Geographic coverage

Urban and rural areas of Papua New Guinea

Analysis unit

Household, Individual

Kind of data

Sample survey data [ssd]

Sampling procedure

The initial sample was drawn through Random Digit Dialing (RDD) with geographic stratification from a large random sample of Digicel’s subscribers. As an objective of the survey was to measure changes in household economic wellbeing over time, the HFPS sought to contact a consistent number of households across each province month to month. This was initially a repeated cross section from April 2023-Dec 2023. The resulting overall sample has a probability-based weighted design, with a proportionate stratification to achieve a proper geographical representation. More information on sampling for the cross-sectional monthly sample can be found in previous documentation for the PNG HFPS data.

A monthly panel was established in October 2023, that is ongoing as of March 2025. In each subsequent round of data collection after October 2024, the survey firm would first attempt to contact all households from the previous month, and then attempt to contact households from earlier months that had dropped out. After previous numbers were exhausted, RDD with geographic stratification was used for replacement households.

Mode of data collection

Computer Assisted Telephone Interview [cati]

Research instrument

he questionnaire, which can be found in the External Resources of this documentation, is in English with a Pidgin translation.

The survey instrument for Q1 2025 consists of the following modules: -1. Basic Household information, -2. Household Roster, -3. Labor, -4a Food security, -4b Food prices -5. Household income, -6. Agriculture, -8. Access to services, -9. Assets -10. Wellbeing and shocks -10a. WASH

Cleaning operations

The raw data were cleaned by the World Bank team using STATA. This included formatting and correcting errors identified through the survey’s monitoring and quality control process. The data are presented in two datasets: a household dataset and an individual dataset. The individual dataset contains information on individual demographics and labor market outcomes of all household members aged 15 and above, and the household data set contains information about household demographics, education, food security, food prices, household income, agriculture activities, social protection, access to services, and durable asset ownership. The household identifier (hhid) is available in both the household dataset and the individual dataset. The individual identifier (id_member) can be found in the individual dataset.
h
data-analysis-datasets
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
canns, data-analysis-datasets [Dataset]. https://huggingface.co/datasets/canns-team/data-analysis-datasets
Explore at:
Dataset authored and provided by
canns
Description
CANNS Analysis Datasets

This repository contains example datasets for the CANNS (Continuous Attractor Neural Networks) data analysis package.

Datasets ROI_data.txt (703 KB)

Description: 1D CANN ROI data for bump analysis Format: Text file with neural activity measurements Usage: 1D CANN analysis, MCMC bump fitting Example: Used in 1D CANN analysis tutorials

grid_1.npz (8.7 MB)

Description: Grid cell spike data with position information Format:… See the full description on the dataset page: https://huggingface.co/datasets/canns-team/data-analysis-datasets.
f
Sample of sequential rules mined from anonymity datasets generated by LBS...
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Aug 11, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Liu, Zhao; Wu, Chenxue; Zhu, Yunhong; Zhang, Haitao; Chen, Zewei (2016). Sample of sequential rules mined from anonymity datasets generated by LBS continuous queries. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001597751
Explore at:
Dataset updated
Aug 11, 2016
Authors
Liu, Zhao; Wu, Chenxue; Zhu, Yunhong; Zhang, Haitao; Chen, Zewei
Description
Sample of sequential rules mined from anonymity datasets generated by LBS continuous queries.
High Frequency Phone Survey, Continuous Data Collection 2023 - Vanuatu
microdata.pacificdata.org
Updated Mar 23, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shohei Nakamura (2025). High Frequency Phone Survey, Continuous Data Collection 2023 - Vanuatu [Dataset]. https://microdata.pacificdata.org/index.php/catalog/878
Explore at:
Dataset updated
Mar 23, 2025
Dataset provided by
World Bank Grouphttp://www.worldbank.org/
William Seitz
Shohei Nakamura
Time period covered
2024 - 2025
Area covered
Vanuatu
Description
Abstract

Access to up-to-date socio-economic data is a widespread challenge in Vanuatu and other Pacific Island Countries. To increase data availability and promote evidence-based policymaking, the Pacific Observatory provides innovative solutions and data sources to complement existing survey data and analysis. One of these data sources is a series of High Frequency Phone Surveys (HFPS), which began in 2020 to monitor the socio-economic impacts of the COVID-19 Pandemic, and since 2023 has grown into a series of continuous surveys for socio-economic monitoring. See https://www.worldbank.org/en/country/pacificislands/brief/the-pacific-observatory for further details.

For Vanuatu, data for December 2023 – January 2025 was collected with each month having approximately 1000 households in the sample and is representative of urban and rural areas but is not representative at the province level. This dataset contains combined monthly survey data for all months of the continuous HFPS in Vanuatu. There is one date file for household level data with a unique household ID. And a separate file for individual level data within each household data, that can be matched to the household file using the household ID, and which also has a unique individual ID within the household data which can be used to track individuals over time within households, where the data is panel data.

Geographic coverage

National, urban and rural. Six provinces were covered by this survey: Sanma, Shefa, Torba, Penama, Malampa and Tafea.

Analysis unit

Household and individuals.

Kind of data

Sample survey data [ssd]

Sampling procedure

The Vanuatu High Frequency Phone Survey (HFPS) sample is drawn from the list of customer phone numbers (MSIDNS) provided by Digicel Vanuatu, one of the country’s two main mobile providers. Digicel’s customer base spans all regions of Vanuatu. For the initial data collection, Digicel filtered their MSIDNS database to ensure a representative distribution across regions. Recognizing the challenge of reaching low-income respondents, Digicel also included low-income areas and customers with a low-income profile (defined by monthly spending between 50 and 150 VT), as well as those with only incoming calls or using the IOU service without repayment. These filtered lists were then randomized, and enumerators began calling the numbers.

This approach was used to complete the first round of 1,000 interviews. The respondents from this first round formed a panel to be surveyed monthly. Each month, phone numbers from the panel are contacted until all have been interviewed, at which point new phone numbers (fresh MSIDNS from Digicel’s database) are used to replace those that have been exhausted. These new respondents are then added to the panel for future surveys.

Mode of data collection

Computer Assisted Telephone Interview [cati]

Research instrument

The questionnaire was developed in both English and Bislama. Sections of the Questionnaire:

-Interview Information -Household Roster (separate modules for new households and returning households) -Labor (separate modules for new households and returning households) -Food Security
-Household Income -Agriculture
-Social Protection
-Access to Services -Assets -Perceptions -Follow-up

Cleaning operations

At the end of data collection, the raw dataset was cleaned by the survey firm and the World Bank team. Data cleaning mainly included formatting, relabeling, and excluding survey monitoring variables (e.g., interview start and end times). Data was edited using the software STATA.

The data are presented in two datasets: a household dataset and an individual dataset. The total number of observations is 13,779 in the household dataset and 77,501 in the individual dataset. The individual dataset contains information on individual demographics and labor market outcomes of all household members aged 15 and above, and the household data set contains information about household demographics, education, food security, household income, agriculture activities, social protection, access to services, and durable asset ownership. The household identifier (hhid) is available in both the household dataset and the individual dataset. The individual identifier (hhid_mem) can be found in the individual dataset.

Response rate

In November 2024, a total of 7,874 calls were made. Of these, 2,251 calls were successfully connected, and 1,000 respondents completed the survey. By February 2024, the sample was fully comprised of returning respondents, with a re-contact rate of 99.9 percent.
DataCI Continuous Text Classification Example Using Yelp Dataset
data.niaid.nih.gov
Updated Aug 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Li, Yuanming; Yelp Inc. (2023). DataCI Continuous Text Classification Example Using Yelp Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8288432
Explore at:
Dataset updated
Aug 28, 2023
Dataset provided by
Yelphttp://yelp.com/
Individual Researcher
Authors
Li, Yuanming; Yelp Inc.
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We are using the Yelp Review Dataset as the streaming data source for the DataCI example. We have processed the Yelp review dataset into a daily-based dataset by its date. In this dataset, we will only use the data from 2020-09-01 to 2020-11-30 to simulate the streaming data scenario. We are downloading two versions of the training and validation datasets:

yelp_review_train@2020-10: from 2020-09-01 to 2020-10-15

yelp_review_val@2020-10: from 2020-10-16 to 2020-10-31

yelp_review_train@2020-11: from 2020-10-01 to 2020-11-15

yelp_review_val@2020-11: from 2020-11-16 to 2020-11-30
p
High Frequency Phone Survey, Continuous Data Collection 2023 - Tonga
microdata.pacificdata.org
Updated Apr 15, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
William Seitz (2025). High Frequency Phone Survey, Continuous Data Collection 2023 - Tonga [Dataset]. https://microdata.pacificdata.org/index.php/catalog/879
Explore at:
Dataset updated
Apr 15, 2025
Dataset provided by
William Seitz
Shohei Nakamura
Time period covered
2023 - 2024
Area covered
Tonga
Description
Abstract

Access to up-to-date socio-economic data is a widespread challenge in Tonga and other Pacific Island Countries. To increase data availability and promote evidence-based policymaking, the Pacific Observatory provides innovative solutions and data sources to complement existing survey data and analysis. One of these data sources is a series of High Frequency Phone Surveys (HFPS), which began in 2020 as a way to monitor the socio-economic impacts of the COVID-19 Pandemic, and since 2023 has grown into a series of continuous surveys for socio-economic monitoring. See https://www.worldbank.org/en/country/pacificislands/brief/the-pacific-observatory for further details. For Tonga, after two rounds of data collection from in 2022, monthly HFPS data collection commenced in April 2023 and continued until November 2024 (but with some gaps in the months of collection). The survey collected socio-economic data on topics including employment, income, food security, health, food prices, assets and well-being. Each month of collection has approximately 415 households in the sample and is representative of urban and rural areas. This dataset contains combined monthly survey data for all months of the continuous HFPS in Tonga.

Geographic coverage

National urban and rural areas (5 islands): Tongatapu, Vava'u, Ha'apai, Eua, Ongo Niua

Analysis unit

Individual and household.

Kind of data

Sample survey data [ssd]

Sampling procedure

The Tonga High Frequency Phone Survey (HFPS) monthly sample was generated in three ways. The first method is Random Digit Dialing (RDD) process covering all cell telephone numbers active at the time of the sample selection. The RDD methodology generates virtually all possible telephone numbers in the country under the national telephone numbering plan and then draws a random sample of numbers. This method guarantees full coverage of the population with a phone.

First, a large first-phase sample of cell phone numbers was selected and screened through an automated process to identify the active numbers. Then, a smaller second-phase sample was selected from the active residential numbers identified in the first-phase sample and was delivered to the data collection team to be called by the interviewers. When a cell phone was called, the call answerer was interviewed as long as he or she was 18 years of age or above and knowledgeable about the household activities.

It was initially planned to stratify the sample by island group based on the phone number prefixes. However, this was not feasible given the high internal migration across islands and the atypical assignment of phone number prefixes across islands in Tonga. The raw sample is overrepresenting urban areas and the population of Tongatapu.

Mode of data collection

Computer Assisted Telephone Interview [cati]

Research instrument

The questionnaire was developed in both English and Tongan and can be found in this documentation in Excel format. Sections of the Questionnaire are provided below: 1. Interview information and Basic information 2. Household roster 3. Labor 4. Food security and food prices 5. Household income 6. Agriculture 7. Social protection 8. Access to services 9. Assets 10. Education 11. Follow up

Cleaning operations

At the end of data collection, the raw dataset was cleaned by the survey firm and the World Bank team. Data cleaning mainly included formatting, relabeling, and excluding survey monitoring variables (e.g., interview start and end times). Data was edited using the software Stata.
Company Datasets for Business Profiling
datarade.ai
Updated Feb 23, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oxylabs (2017). Company Datasets for Business Profiling [Dataset]. https://datarade.ai/data-products/company-datasets-for-business-profiling-oxylabs
Explore at:
.json, .xml, .csv, .xlsAvailable download formats
Dataset updated
Feb 23, 2017
Dataset authored and provided by
Oxylabs
Area covered
Canada, Isle of Man, Tunisia, British Indian Ocean Territory, Taiwan, Bangladesh, Andorra, Northern Mariana Islands, Nepal, Moldova (Republic of)
Description
Company Datasets for valuable business insights!

Discover new business prospects, identify investment opportunities, track competitor performance, and streamline your sales efforts with comprehensive Company Datasets.

These datasets are sourced from top industry providers, ensuring you have access to high-quality information:

Owler: Gain valuable business insights and competitive intelligence. -AngelList: Receive fresh startup data transformed into actionable insights. -CrunchBase: Access clean, parsed, and ready-to-use business data from private and public companies. -Craft.co: Make data-informed business decisions with Craft.co's company datasets. -Product Hunt: Harness the Product Hunt dataset, a leader in curating the best new products.

We provide fresh and ready-to-use company data, eliminating the need for complex scraping and parsing. Our data includes crucial details such as:

Company name;

Size;

Founding date;

Location;

Industry;

Revenue;

Employee count;

Competitors.

You can choose your preferred data delivery method, including various storage options, delivery frequency, and input/output formats.

Receive datasets in CSV, JSON, and other formats, with storage options like AWS S3 and Google Cloud Storage. Opt for one-time, monthly, quarterly, or bi-annual data delivery.

With Oxylabs Datasets, you can count on:

Fresh and accurate data collected and parsed by our expert web scraping team.

Time and resource savings, allowing you to focus on data analysis and achieving your business goals.

A customized approach tailored to your specific business needs.

Legal compliance in line with GDPR and CCPA standards, thanks to our membership in the Ethical Web Data Collection Initiative.

Pricing Options:

Standard Datasets: choose from various ready-to-use datasets with standardized data schemas, priced from $1,000/month.

Custom Datasets: Tailor datasets from any public web domain to your unique business needs. Contact our sales team for custom pricing.

Experience a seamless journey with Oxylabs:

Understanding your data needs: We work closely to understand your business nature and daily operations, defining your unique data requirements.

Developing a customized solution: Our experts create a custom framework to extract public data using our in-house web scraping infrastructure.

Delivering data sample: We provide a sample for your feedback on data quality and the entire delivery process.

Continuous data delivery: We continuously collect public data and deliver custom datasets per the agreed frequency.

Unlock the power of data with Oxylabs' Company Datasets and supercharge your business insights today!
Dataset sample.
plos.figshare.com
xls
Updated Apr 18, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhanhui Hu; Guangzhong Liu; Xinyu Xiang; Yanping Li; Siqing Zhuang (2024). Dataset sample. [Dataset]. http://doi.org/10.1371/journal.pone.0298809.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0298809.t002
Dataset updated
Apr 18, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Zhanhui Hu; Guangzhong Liu; Xinyu Xiang; Yanping Li; Siqing Zhuang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
With the rapid development of the Internet, the continuous increase of malware and its variants have brought greatly challenges for cyber security. Due to the imbalance of the data distribution, the research on malware detection focuses on the accuracy of the whole data sample, while ignoring the detection rate of the minority categories’ malware. In the dataset sample, the normal data samples account for the majority, while the attacks’ malware accounts for the minority. However, the minority categories’ attacks will bring great losses to countries, enterprises, or individuals. For solving the problem, this study proposed the GNGS algorithm to construct a new balance dataset for the model algorithm to pay more attention to the feature learning of the minority attacks’ malware to improve the detection rate of attacks’ malware. The traditional malware detection method is highly dependent on professional knowledge and static analysis, so we used the Self-Attention with Gate mechanism (SAG) based on the Transformer to carry out feature extraction between the local and global features and filter irrelevant noise information, then extracted the long-distance dependency temporal sequence features by the BiGRU network, and obtained the classification results through the SoftMax classifier. In the study, we used the Alibaba Cloud dataset for malware multi-classification. Compared the GSB deep learning network model with other current studies, the experimental results showed that the Gaussian noise generation strategy (GNGS) could solve the unbalanced distribution of minority categories’ malware and the SAG-BiGRU algorithm obtained the accuracy rate of 88.7% on the eight-classification, which has better performance than other existing algorithms, and the GSB model also has a good effect on the NSL-KDD dataset, which showed the GSB model is effective for other network intrusion detection.
Detailed analysis of data generation procedure.
plos.figshare.com
xls
Updated Nov 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Insoo Kim; Junhee Seok; Yoojoong Kim (2023). Detailed analysis of data generation procedure. [Dataset]. http://doi.org/10.1371/journal.pone.0294513.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0294513.t001
Dataset updated
Nov 16, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Insoo Kim; Junhee Seok; Yoojoong Kim
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Traditionally, datasets with multiple censored time-to-events have not been utilized in multivariate analysis because of their high level of complexity. In this paper, we propose the Censored Time Interval Analysis (CTIVA) method to address this issue. It estimates the joint probability distribution of actual event times in the censored dataset by implementing a statistical probability density estimation technique on the dataset. Based on the acquired event time, CTIVA investigates variables correlated with the interval time of events via statistical tests. The proposed method handles both categorical and continuous variables simultaneously—thus, it is suitable for application on real-world censored time-to-event datasets, which include both categorical and continuous variables. CTIVA outperforms traditional censored time-to-event data handling methods by 5% on simulation data. The average area under the curve (AUC) of the proposed method on the simulation dataset exceeds 0.9 under various conditions. Further, CTIVA yields novel results on National Sample Cohort Demo (NSCD) and proteasome inhibitor bortezomib dataset, a real-world censored time-to-event dataset of medical history of beneficiaries provided by the National Health Insurance Sharing Service (NHISS) and National Center for Biotechnology Information (NCBI). We believe that the development of CTIVA is a milestone in the investigation of variables correlated with interval time of events in presence of censoring.
p
High Frequency Phone Survey, Continuous Data Collection 2023 - Solomon...
microdata.pacificdata.org
Updated Mar 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Darian Naidoo and William Seitz (2025). High Frequency Phone Survey, Continuous Data Collection 2023 - Solomon Islands [Dataset]. https://microdata.pacificdata.org/index.php/catalog/875
Explore at:
Dataset updated
Mar 19, 2025
Dataset authored and provided by
Darian Naidoo and William Seitz
Time period covered
2023 - 2024
Area covered
Solomon Islands
Description
Abstract

Access to up-to-date socio-economic data is a widespread challenge in Solomon Islands and other Pacific Island Countries. To increase data availability and promote evidence-based policymaking, the Pacific Observatory provides innovative solutions and data sources to complement existing survey data and analysis. One of these data sources is a series of High Frequency Phone Surveys (HFPS), which began in 2020 as a way to monitor the socio-economic impacts of the COVID-19 Pandemic, and since 2023 has grown into a series of continuous surveys for socio-economic monitoring. See https://www.worldbank.org/en/country/pacificislands/brief/the-pacific-observatory for further details.

For Solmon Islands, after five rounds of data collection from 2020-2020, in April 2023 a monthly HFPS data collection commenced and continued for 18 months (ending September 2024) –on topics including employment, income, food security, health, food prices, assets and well-being. Fieldwork took place in two non-consecutive weeks of each month. Data for April 2023-December 2023 were a repeated cross section, while January 2024 established the first month of a panel, the was continued to September 2024. Each month has approximately 550 households in the sample and is representative of urban and rural areas, but is not representative at the province level. This dataset contains combined monthly survey data for all months of the continuous HFPS in Solomon Islands. There is one date file for household level data with a unique household ID. and a separate file for individual level data within each household data, that can be matched to the household file using the household ID, and which also has a unique individual ID within the household data which can be used to track individuals over time within households, where the data is panel data.

Geographic coverage

Urban and rural areas of Solomon Islands.

Analysis unit

Household, individual.

Kind of data

Sample survey data [ssd]

Sampling procedure

The initial sample was drawn through Random Digit Dialing (RDD) with geographic stratification. As an objective of the survey was to measure changes in household economic wellbeing over time, the HFPS sought to contact a consistent number of households across each province month to month. This was initially a repeated cross section from April 2023-Dec 2023. The initial sample was drawn from information provided by a major phone service provider in Solomon Islands, covering all the provinces in the country. It had a probability-based weighted design, with a proportionate stratification to achieve geographical representation. The geographical distribution compared to the 2019 Census is listed below for the first month of the HFPS monthly survey:

Choiseul : Census: 4.3%, HFPS: 5.2% Western : Census: 14.4%, HFPS: 13.7% Isabel : Census: 4.8%, HFPS: 4.7% Central : Census: 3.6%, HFPS: 5.2% Ren Bell : Census: 0.6%, HFPS: 1.4% Guadalcanal: Census: 19.8%, HFPS: 21.1% Malaita : Census: 23.1%, HFPS: 18.7% Makira : Census: 5.6%, HFPS: 5.6% Temotu: Census: 3.0%, HFPS: 3% Honiara: Census: 20.7%, HFPS: 21.3%

Source: Census of Population and Housing 2019

Note: The values in the HFPS column represent the proportion of survey participants residing in each province, based on the raw HFPS data from April.

In April 2023, the geographic distribution of World Bank HFPS participants was generally similar to that of the census data at the province level, though within provinces, areas with less mobile phone connectivity are likely to be underrepresented. One indication of this is that urban areas constituted 38.2 percent of the survey sample, which is a slight overrepresentation, compared to 32.5 percent in the Census 2019.

A monthly panel was established in January 2024, that is ongoing as of March 2025. In each subsequent month after January 2024, the survey firm would first attempt to contact all households from the previous month and then attempt to contact households from earlier months that had dropped out. After previous numbers were exhausted, RDD with geographic stratification was used for replacement households. Across all months of the survey a total of, 9,926 interviews were completed.

Mode of data collection

Computer Assisted Telephone Interview [cati]

Research instrument

The questionnaire, which can be found in the External Resources of this documentation, is available in English, with Solomons Pijin translation. There were few changes to the questionnaire across the survey months, but some sections were only introduced in 2024, namely energy access questions and questions to inform the baseline data of the Solomon Islands Government Integrated Economic Development and Climate Resilience (IEDCR) project.

Cleaning operations

The raw data were cleaned by the World Bank team using STATA. This included formatting and correcting errors identified through the survey’s monitoring and quality control process. The data are presented in two datasets: a household dataset and an individual dataset. The total number of observations is 9,926 in the household dataset and 62,054 in the individual dataset. The individual dataset contains information on individual demographics and labor market outcomes of all household members aged 15 and above, and the household data set contains information about household demographics, education, food security, food prices, household income, agriculture activities, social protection, access to services, and durable asset ownership. The household identifier (hhid) is available in both the household dataset and the individual dataset. The individual identifier (id_member) can be found in the individual dataset.
d
Data from: Data for multiple linear regression models for predicting...
catalog.data.gov
data.usgs.gov
+2more
Updated Nov 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Data for multiple linear regression models for predicting microcystin concentration action-level exceedances in selected lakes in Ohio [Dataset]. https://catalog.data.gov/dataset/data-for-multiple-linear-regression-models-for-predicting-microcystin-concentration-action
Explore at:
Dataset updated
Nov 19, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Ohio
Description
Site-specific multiple linear regression models were developed for eight sites in Ohio—six in the Western Lake Erie Basin and two in northeast Ohio on inland reservoirs--to quickly predict action-level exceedances for a cyanotoxin, microcystin, in recreational and drinking waters used by the public. Real-time models include easily- or continuously-measured factors that do not require that a sample be collected. Real-time models are presented in two categories: (1) six models with continuous monitor data, and (2) three models with on-site measurements. Real-time models commonly included variables such as phycocyanin, pH, specific conductance, and streamflow or gage height. Many of the real-time factors were averages over time periods antecedent to the time the microcystin sample was collected, including water-quality data compiled from continuous monitors. Comprehensive models use a combination of discrete sample-based measurements and real-time factors. Comprehensive models were useful at some sites with lagged variables (< 2 weeks) for cyanobacterial toxin genes, dissolved nutrients, and (or) N to P ratios. Comprehensive models are presented in three categories: (1) three models with continuous monitor data and lagged comprehensive variables, (2) five models with no continuous monitor data and lagged comprehensive variables, and (3) one model with continuous monitor data and same-day comprehensive variables. Funding for this work was provided by the Ohio Water Development Authority and the U.S. Geological Survey Cooperative Water Program.
English tense Dataset
kaggle.com
zip
Updated Oct 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LeewanHung (2023). English tense Dataset [Dataset]. https://www.kaggle.com/datasets/leewanhung/tense-dataset/discussion
Explore at:
zip(29970 bytes)Available download formats
Dataset updated
Oct 14, 2023
Authors
LeewanHung
Description
This dataset includes examples of different tenses in the English language to aid English learners and educators in understanding the usage of various tenses. Each example sentence is paired with the corresponding tense it represents.

Dataset Information:

Data Collection Source: The data was created by generating example sentences that demonstrate the use of different tenses in the English language.

Data Fields: Sentence: An example sentence in English. Tense: - present - past - future - present continuous - past continuous - future continuous - present perfect - past perfect - future perfect - present perfect continuous - past perfect continuous
MSHA Coal Dust Samples
catalog.data.gov
datasets.ai
Updated Apr 8, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mine Safety and Health Administration (2025). MSHA Coal Dust Samples [Dataset]. https://catalog.data.gov/dataset/msha-coal-dust-samples
Explore at:
Dataset updated
Apr 8, 2025
Dataset provided by
Mine Safety and Health Administrationhttp://www.msha.gov/
Description
All operator and inspector dust samples taken for gravimetric samples. It includes information such as cassette numbers, date the sample was taken, initial and final weights, sample type, occupation codes related to the person taking the sample and mine information. Cassette number is the primary key for gravimetric samples. It also contains operator Continuous Personal Dust Monitor (CPDM) samples for operators as of 2/1/2016. The unique key is the CPDM file name. This dataset can be linked to the Mines dataset for further mine information.