17 datasets found

Data Pre-Processing : Data Integration
kaggle.com
Updated Aug 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mr.Machine (2022). Data Pre-Processing : Data Integration [Dataset]. https://www.kaggle.com/datasets/ilayaraja07/data-preprocessing-data-integration
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 2, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Mr.Machine
Description
In this exercise, we'll merge the details of students from two datasets, namely student.csv and marks.csv. The student dataset contains columns such as Age, Gender, Grade, and Employed. The marks.csv dataset contains columns such as Mark and City. The Student_id column is common between the two datasets. Follow these steps to complete this exercise
f
RSR1.5 of ICP and CICP algorithms in two steps on US-MERGE and US-SNAP...
plos.figshare.com
xls
Updated Jun 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hengkai Guo; Guijin Wang; Lingyun Huang; Yuxin Hu; Chun Yuan; Rui Li; Xihai Zhao (2023). RSR1.5 of ICP and CICP algorithms in two steps on US-MERGE and US-SNAP datasets. [Dataset]. https://plos.figshare.com/articles/dataset/RSR_sub_1_5_sub_of_ICP_and_CICP_algorithms_in_two_steps_on_US_MERGE_and_US_SNAP_datasets_/2296066
Explore at:
xlsAvailable download formats
Dataset updated
Jun 15, 2023
Dataset provided by
PLOS ONE
Authors
Hengkai Guo; Guijin Wang; Lingyun Huang; Yuxin Hu; Chun Yuan; Rui Li; Xihai Zhao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
United States
Description
RSR1.5 of ICP and CICP algorithms in two steps on US-MERGE and US-SNAP datasets.
d
Replication Data for: Bespoke NPO Taxonomies - Step 02: Merge and Refine...
search.dataone.org
Updated Nov 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Santamarina, Francisco (2023). Replication Data for: Bespoke NPO Taxonomies - Step 02: Merge and Refine Data [Dataset]. http://doi.org/10.7910/DVN/EO2HIM
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/EO2HIM
Dataset updated
Nov 19, 2023
Dataset provided by
Harvard Dataverse
Authors
Santamarina, Francisco
Description
Pre-processed mission statements and additional data from 1023-EZ approvals for 2018 and 2019. For additional information on cleaning steps, please go to the project's replication GitHub page.
d
Joiner
search.dataone.org
Updated Sep 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HU, Tao (2024). Joiner [Dataset]. http://doi.org/10.7910/DVN/0BM2IQ
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/0BM2IQ
Dataset updated
Sep 24, 2024
Dataset provided by
Harvard Dataverse
Authors
HU, Tao
Description
The joiner is a component often used in workflows to merge or join data from different sources or intermediate steps into a single output. In the context of Common Workflow Language (CWL), the joiner can be implemented as a step that combines multiple inputs into a cohesive dataset or output. This might involve concatenating files, merging data frames, or aggregating results from different computations.
f
RSR1.5 and computation time with the same configuration for different...
plos.figshare.com
xls
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hengkai Guo; Guijin Wang; Lingyun Huang; Yuxin Hu; Chun Yuan; Rui Li; Xihai Zhao (2023). RSR1.5 and computation time with the same configuration for different feature-based algorithms on US-MERGE and US-SNAP datasets. [Dataset]. https://plos.figshare.com/articles/dataset/RSR_sub_1_5_sub_and_computation_time_with_the_same_configuration_for_different_feature_based_algorithms_on_US_MERGE_and_US_SNAP_datasets_/2295997
Explore at:
xlsAvailable download formats
Dataset updated
May 30, 2023
Dataset provided by
PLOS ONE
Authors
Hengkai Guo; Guijin Wang; Lingyun Huang; Yuxin Hu; Chun Yuan; Rui Li; Xihai Zhao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
United States
Description
RSR1.5 and computation time with the same configuration for different feature-based algorithms on US-MERGE and US-SNAP datasets.
Legal-Linguistic Path Dependence and the Scalability of Cultural Industries:...
zenodo.org
bin
Updated Jun 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anon Anon; Anon Anon (2025). Legal-Linguistic Path Dependence and the Scalability of Cultural Industries: From Elizabethan Theater to Global IP Regimes [Dataset]. http://doi.org/10.5281/zenodo.15115958
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15115958
Dataset updated
Jun 17, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anon Anon; Anon Anon
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Title:
Legal-Linguistic Path Dependence and the Scalability of Cultural Industries: From Elizabethan Theater to Global IP Regimes

Creator:
Anonymous

DOI:
10.5281/zenodo.15115958

Version:
v1 — Published March 31, 2025

License:
Creative Commons Attribution 4.0 International (CC BY 4.0)

Description:
This dataset accompanies the research study investigating how legal origins and language regimes co-evolve to shape the institutional scalability of cultural industries. Through a comparative historical lens focused on Elizabethan England and Habsburg Spain, the dataset supports the claim that English-based common law systems are more conducive to global IP regime formation than Spanish-based civil law systems. The dataset integrates:

The 2024 EF English Proficiency Index (EF EPI),

2023 GDP data by country,

2024 International Property Rights Index (IPRI), and

UNESCO statistics on film production language.

These sources have been harmonized for cross-country comparative analysis, including the construction of a “Common Law Dummy” and a merged panel file for empirical testing of the legal-linguistic synergy hypothesis.

Files Included:

EF_EPI_2024_with_Legal_Origin_Common_Law_Dummy.xlsx

GDP_2023.xlsx (corrected column header: "Country")

IPRI_Country_Tables_Manual.xlsx

UNESCO Language of film production - Langue de production des films.xlsx

🛠 Steps to Run in Google Colab

Step 1: Correct the Error in GDP_2023.xlsx

Open the file in Excel or LibreOffice.

Rename the first column from "ountry" to "Country".

Save and re-upload.

Step 2: Upload Files to Google Colab

Open https://colab.research.google.com/

Select File > Upload notebook or create a new one.

Upload all four .xlsx files via the file panel or using:

python

CopiarEditar

from google.colab import files uploaded = files.upload()

import pandas as pd
import numpy as np
import statsmodels.api as sm

# Load data
epi_df = pd.read_excel('EF_EPI_2024_with_Legal_Origin_Common_Law_Dummy.xlsx')
gdp_df = pd.read_excel('GDP_2023.xlsx')
ipri_df = pd.read_excel('IPRI_Country_Tables_Manual.xlsx')

# Standardize and rename country columns
epi_df['Country'] = epi_df['Country'].str.upper()
gdp_df = gdp_df.rename(columns={'ountry': 'Country'}) # corrects typo in original column name
gdp_df['Country'] = gdp_df['Country'].str.upper()
ipri_df['Country'] = ipri_df['COUNTRY'].str.upper()

# Subset relevant IPRI columns
ipri_df = ipri_df[['Country', 'Intellectual Property Rights (IPR)']]

# Merge datasets
merged_df = epi_df.merge(gdp_df, on='Country', how='inner').merge(ipri_df, on='Country', how='inner')
print("Merged rows:", merged_df.shape)

# Create new variables
merged_df['Log_GDP'] = np.log(merged_df['GDP'])
merged_df['Interaction'] = merged_df['Common_Law'] * merged_df['English_Lingua_Franca']

# Define dependent variable
y = merged_df['Intellectual Property Rights (IPR)']

# Model 1: without interaction
X1 = merged_df[['Common_Law', 'English_Lingua_Franca', 'Log_GDP', 'EF EPI Score']]
X1 = sm.add_constant(X1)

# Model 2: with interaction
X2 = merged_df[['Common_Law', 'English_Lingua_Franca', 'Interaction', 'Log_GDP', 'EF EPI Score']]
X2 = sm.add_constant(X2)

# Fit OLS models with robust standard errors (HC3)
model1 = sm.OLS(y, X1).fit(cov_type='HC3')
model2 = sm.OLS(y, X2).fit(cov_type='HC3')

# Print results
print(" === Model 1 Results ===")
print(model1.summary())

print(" === Model 2 Results ===")
print(model2.summary())
d
AFSC/REFM: Digitized 2005 GOA Trawl Logbooks merged with Fish Ticket and...
catalog.data.gov
fisheries.noaa.gov
+1more
Updated Jun 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(Point of Contact, Custodian) (2025). AFSC/REFM: Digitized 2005 GOA Trawl Logbooks merged with Fish Ticket and Observer data [Dataset]. https://catalog.data.gov/dataset/afsc-refm-digitized-2005-goa-trawl-logbooks-merged-with-fish-ticket-and-observer-data1
Explore at:
Dataset updated
Jun 1, 2025
Dataset provided by
(Point of Contact, Custodian)
Description
The data include a full year of logbook forms for vessels 60-124 feet in length (the partial coverage fleet) that had participated in the trawl flatfish fishery of 2005 in the Gulf of Alaska. The digitized hauls were not restricted exclusively to the population of trips to the Gulf of Alaska (GOA), since some vessels also participated in BSAI trawl fisheries. A total of 55 unique vessels daily fishing logbooks (9 catcher-processors and 46 catcher vessels) were digitized into the Vessel Log System database. The daily production section for catcher-processors was not digitized, therefore they were excluded from the data entry procedure and we focus on the remaining catcher vessels. These logbook records are then combined with observer and fish ticket data for the same vessels to create a more complete accounting of each vessels activity in 2005. In order to examine the utility, uniqueness, and the congruence of data contained in the logbooks with other sources, we collated vessel records from logbook data with Alaska Commercial Fisheries Entry Commission (CFEC) fish tickets (retrieved from the Alaska Fisheries Information Network (AKFIN)) and the North Pacific Groundfish Observer Program observer records. Merging of datasets was a multiple-step process. The first merge of data was between the quality-controlled observer and fish ticket data. Prior to 2007, the observer program did not track trip-level information such as the date of departure and return to/from port, or landing date. Consequently, to combine the 2005 haul-level observer data with the trip-level data from the fish tickets for a given vessel, each observer haul was merged with a fish ticket record if the haul retrieval date from the observer data was contained within in the modified start and end date derived from the fish ticket data (see above). Since the starting date on the fish ticket record represents the date fishing began, rather than the date a vessel left port, all observer haul records should be within the time frame of the fish ticket start and end dates. The observer hauls were therefore given the same trip number as determined by the fish tickets trip numbering algorithm. The same process was then repeated to merge each logbook haul onto the combined fish ticket and observer data. Trip targets were then assigned from the North Pacific Fishery Management Council comprehensive observer database (Council.Comprehensive_obs) for observed trips, and statistical areas denoted on the fish tickets were mapped to Fishery Management Plan (FMP) areas. After quality control, the dataset was considered complete, and is referred to as the combined dataset.
STEPwise Survey for Non Communicable Diseases Risk Factors 2005 - Zimbabwe
datacatalog.ihsn.org
catalog.ihsn.org
Updated Jun 26, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ministry of Health and Child Welfare (2017). STEPwise Survey for Non Communicable Diseases Risk Factors 2005 - Zimbabwe [Dataset]. https://datacatalog.ihsn.org/catalog/6968
Explore at:
Dataset updated
Jun 26, 2017
Dataset provided by
World Health Organizationhttps://who.int/
Ministry of Health and Child Welfare
Time period covered
2005
Area covered
Zimbabwe
Description
Abstract

Noncommunicable diseases are the top cause of deaths. In 2008, more than 36 million people worldwide died of such diseases. Ninety per cent of those lived in low-income and middle-income countries.WHO Maps Noncommunicable Disease Trends in All Countries The STEPS Noncommunicable Disease Risk Factor Survey, part of the STEPwise approach to surveillance (STEPS) Adult Risk Factor Surveillance project by the World Health Organization (WHO), is a survey methodology to help countries begin to develop their own surveillance system to monitor and fight against noncommunicable diseases. The methodology prescribes three steps—questionnaire, physical measurements, and biochemical measurements. The steps consist of core items, core variables, and optional modules. Core topics covered by most surveys are demographics, health status, and health behaviors. These provide data on socioeconomic risk factors and metabolic, nutritional, and lifestyle risk factors. Details may differ from country to country and from year to year.

The general objective of the Zimbabwe NCD STEPS survey was to assess the risk factors of selected NCDs in the adult population of Zimbabwe using the WHO STEPwise approach to non-communicable diseases surveillance. The specific objectives were: - To assess the distribution of life-style factors (physical activity, tobacco and alcohol use), and anthropometric measurements (body mass index and central obesity) which may impact on diabetes and cardiovascular risk factors. - To identify dietary practices that are risk factors for selected NCDs. - To determine the prevalence and determinants of hypertension - To determine the prevalence and determinants of diabetes. - To determine the prevalence and determinants of serum lipid profile.

Geographic coverage

Mashonaland Central, Midlands and Matebeleland South Provinces.

Analysis unit

Household Individual

Universe

The survey comprised of individuals aged 25 years and over.

Kind of data

Sample survey data [ssd]

Sampling procedure

A multistage sampling strategy with 3 stages consisting of province, district and health centre was employed. The World Health Organization STEPwise Approach (STEPS) was used as the design basis for the survey. The 3 randomly selected provinces for the survey were Mashonaland Central, Midlands and Matebeleland South. In each Province four districts were chosen and four health centres were surveyed per district. The survey comprised of individuals aged 25 years and over.The survey was carried out on 3,081 respondents consisting of 1,189 from Midlands,944 from Mashonaland Central and 948 from Matebeleland South. A detailed description of the sampling process is provided in sections 3.8 -3.9. if the survey report provided under the related materials tab.

Sampling deviation

Designing a community-based survey such as this one is fraught with difficulties in ensuring representativeness of the sample chosen. In this survey there was a preponderance of female respondents because of the pattern of employment of males and females which also influences urban rural migration.

The response rate in Midlands was lower than the other two provinces in both STEP 2 and 3. This notable difference was due to the fact that Midlands had more respondents sampled from the urban communities. A higher proportion of urban respondents was formally employed and therefore did not complete STEP 2 and 3 due to conflict with work schedules.

Mode of data collection

Face-to-face [f2f]

Research instrument

In this survey all the core and selected expanded and optional variables were collected. In addition a food frequency questionnaire and a UNICEF developed questionnaire, the Fortification Rapid Assessment Tool (FRAT) were administered to elicit relevant dietary information.

Cleaning operations

Data entry for Step 1 and Step 2 data was carried out as soon as data became available to the data management team. Step 3 data became available in October and data entry was carried out when data quality checks were completed in November. Report writing started in September and a preliminary report became available in December 2005.

Training of data entry clerks Five data entry clerks were recruited and trained for one week. The selection of data entry clerks was based on their performance during previous research carried out by the MOH&CW. The training of the data entry clerks involved the following: - Familiarization with the NCD, FRAT and FFQ questionnaires. - Familiarization with the data entry template. - Development of codes for open-ended questions. - Statistical package (EPI Info 6). - Development of a data entry template using EPI6. - Development of check files for each template - Trial runs (mock runs) to check whether template was complete and user friendly for data entry. - Double entry (what it involves and how to do it and why it should be done). - Pre-primary data cleaning (check whether denominators are tallying) of the data entry template was done.

Data Entry for NCD, FRAT and FFQ questionnaires The questionnaires were sequentially numbered and were then divided among the five data entry clerks. Each one of the data entry clerks had a unique identifier for quality control purposes. Hence, the data was entered into five separate files using the statistical package EPI Info version 6.0. The data entry clerks inter-changed their files for double entry and validation of the data. Preliminary data cleaning was done for each of the five files. The five files were then merged to give a single file. The merged file was then transferred to STATA Version 7.0 using Stat Transfer version 5.0.

Data Cleaning A data-cleaning workshop was held with the core research team members. The objectives of the workshop were: 1. To check all data entry errors. 2. To assess any inconsistencies in data filling. 3. To assess any inconsistencies in data entry. 4. To assess completeness of the data entered.

Data Merging There were two datasets (NCD questionnaire dataset and laboratory dataset) after the data entry process. The two files were merged by joining corresponding observations from the NCD questionnaire dataset with those from the laboratory dataset into single observations using a unique identifier. The ID number was chosen as the unique identifier since it appeared in both data sets. The main aim of merging was to combine the two datasets containing information on behaviour of individuals and the NCD laboratory parameters. When the two data sets were merged, a new merge variable was created. The merge variable took values 1, 2 and 3. Merge variable==1 Observation appeared in the NCD questionnaire data set but a corresponding observation was not in the laboratory data set Merge variable==2 Observation appeared in the laboratory data set but a corresponding observation did not appear in the questionnaire data set Merge variable==3 Observation appeared in both data sets and reflects a complete merge of the two data sets.

Data Cleaning After Merging Data cleaning involved identifying the observations where the merge variable values were either 1 or 2. Merge status for each observation was also changed after effecting any corrections. The other two unique variables that were used in the cleaning were Province, district and health centre since they also appeared in both data sets.

Objectives of cleaning: 1. Match common variables in both data sets and identify inconsistencies in other matching variables e.g. province, district and health centre. 2. To check for any data entry errors.

Response rate

A total of 3,081 respondents were included in the survey against an estimated sample size of 3,000. The response rate for Step 1 was 80% for and for Step 2 70% taking Step 1 accrual as being 100%.
d
A combined global ocean pCO2 climatology combining open ocean and coastal...
catalog.data.gov
Updated Jul 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(Point of Contact) (2025). A combined global ocean pCO2 climatology combining open ocean and coastal areas (NCEI Accession 0209633) [Dataset]. https://catalog.data.gov/dataset/a-combined-global-ocean-pco2-climatology-combining-open-ocean-and-coastal-areas-ncei-accession-
Explore at:
Dataset updated
Jul 1, 2025
Dataset provided by
(Point of Contact)
Description
This dataset contains the partial pressure of carbon dioxide (pCO2) climatology that was created by merging 2 published and publicly available pCO2 datasets covering the open ocean (LandschÃ¼tzer et. al 2016) and the coastal ocean (Laruelle et. al 2017). Both fields were initially created using a 2-step neural network technique. In a first step, the global ocean is divided into 16 biogeochemical provinces using a self-organizing map. In a second step, the non-linear relationship between variables known to drive the surface ocean carbon system and gridded observations from the SOCAT open and coastal ocean datasets (Bakker et. al 2016) is reconstructed using a feed-forward neural network within each province separately. The final product is then produced by projecting driving variables, e.g., surface temperature, chlorophyll, mixed layer depth, and atmospheric CO2 onto oceanic pCO2 using these non-linear relationships (see LandschÃ¼tzer et. al 2016 and Laruelle et. al 2017 for more detail). This results in monthly open ocean pCO2 fields at 1Â°x1Â° resolution and coastal ocean pCO2 fields at 0.25Â°x0.25Â° resolution. To merge the products, we divided each 1Â°x1Â° open ocean bin into 16 equal 0.25Â°x0.25Â° bins without any interpolation. The common overlap area of the products has been merged by scaling the respective products by their mismatch compared to observations from the SOCAT datasets (see LandschÃ¼tzer et. al 2020).
Seamless high-resolution soil moisture from the synergistic merging of the...
zenodo.org
zip
Updated Jun 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Fiifi Tawia Hagan; Seokhyeon Kim; Guojie Wang; Xiaowen Ma; Robin van der Schalie; Yifan Hu; Yi Y. Liu; Alexander Barth; Haonan Liu; Waheed Ullah; Isaac K. Nooni; Asher S. Bhatti; Daniel Fiifi Tawia Hagan; Seokhyeon Kim; Guojie Wang; Xiaowen Ma; Robin van der Schalie; Yifan Hu; Yi Y. Liu; Alexander Barth; Haonan Liu; Waheed Ullah; Isaac K. Nooni; Asher S. Bhatti (2024). Seamless high-resolution soil moisture from the synergistic merging of the FengYun-3 satellite observations series [Dataset]. http://doi.org/10.5281/zenodo.11501751
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11501751
Dataset updated
Jun 6, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Daniel Fiifi Tawia Hagan; Seokhyeon Kim; Guojie Wang; Xiaowen Ma; Robin van der Schalie; Yifan Hu; Yi Y. Liu; Alexander Barth; Haonan Liu; Waheed Ullah; Isaac K. Nooni; Asher S. Bhatti; Daniel Fiifi Tawia Hagan; Seokhyeon Kim; Guojie Wang; Xiaowen Ma; Robin van der Schalie; Yifan Hu; Yi Y. Liu; Alexander Barth; Haonan Liu; Waheed Ullah; Isaac K. Nooni; Asher S. Bhatti
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These datasets are results from merging three FengYun passive microwave soil moisture observations at a 15kmx15km spatial resolution from 2011 to 2020 with continuous extension as data becomes available. Here, we rely on a merging technique that minimizes mean square error (MSE) using the signal-to-noise ratio (SNRopt) of the input parent products to first merge subdaily soil moisture products into dail averages. From these, these are gap-filled using a Data INterpolating Convolutional Auto-Encoder, DINCAE (FY3_Reoconstructed_*). The advantage of this method is that it comes with error variances(FY3_ErVar_*) for each pixel and time step which are useful for sevral applications.
h
HuggingFaceH4_ultrafeedback_binarized_filtered_10k_sampled
huggingface.co
Updated Apr 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
jungki son (2025). HuggingFaceH4_ultrafeedback_binarized_filtered_10k_sampled [Dataset]. https://huggingface.co/datasets/aeolian83/HuggingFaceH4_ultrafeedback_binarized_filtered_10k_sampled
Explore at:
Dataset updated
Apr 10, 2025
Authors
jungki son
Description
Origin Datasets: HuggingFaceH4/ultrafeedback_binarized Dataset Sampling for Merge-Up SLM Training To prepare a dataset of 100,000 samples for Merge-Up SLM training, the following steps were taken:

Filtering for English Only: We used a regular expression to filter the dataset, retaining only the samples that contain English alphabets exclusively. Proportional Sampling by Token Length: Starting from 4,000 tokens, we counted the number of samples in increments of 200 tokens. Based on the… See the full description on the dataset page: https://huggingface.co/datasets/aeolian83/HuggingFaceH4_ultrafeedback_binarized_filtered_10k_sampled.
SynC Data Sets
figshare.com
txt
Updated Apr 2, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zheng Li (2019). SynC Data Sets [Dataset]. http://doi.org/10.6084/m9.figshare.7938644.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7938644.v1
Dataset updated
Apr 2, 2019
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Zheng Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Generating synthetic population data from multiple raw data sources is a fundamental step for many data science tasks with a wide range of applications. However, despite the presence of a number of ap- proaches such as iterative proportional fitting (IPF) and combinatorial optimization (CO), an efficient and standard framework for handling this type of problems is absent. In this study, we propose a multi-stage frame- work called SynC (short for Synthetic Population via Gaussian Copula) to fill the gap. SynC first removes potential outliers in the data and then fits the filtered data with a Gaussian copula model to correctly capture dependencies and marginals distributions of sampled survey data. Fi- nally, SynC leverages neural networks to merge datasets into one. Our key contributions include: 1) propose a novel framework for generating individual level data from aggregated data sources by combining state-of- the-art machine learning and statistical techniques, 2) design a metric for validating the accuracy of generated data when the ground truth is hard to obtain, 3) release an easy-to-use framework implementation for repro- ducibility and demonstrate its effectiveness with the Canada National Census data, and 4) present two real-world use cases where datasets of this nature can be leveraged by businesses.
h
allenai_llama_3.1_tulu_3_405b_preference_mixture_filtered_10k_sampled
huggingface.co
Updated Apr 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
jungki son (2025). allenai_llama_3.1_tulu_3_405b_preference_mixture_filtered_10k_sampled [Dataset]. https://huggingface.co/datasets/aeolian83/allenai_llama_3.1_tulu_3_405b_preference_mixture_filtered_10k_sampled
Explore at:
Dataset updated
Apr 10, 2025
Authors
jungki son
Description
Origin Datasets: allenai/llama-3.1-tulu-3-405b-preference-mixture Dataset Sampling for Merge-Up SLM Training To prepare a dataset of 100,000 samples for Merge-Up SLM training, the following steps were taken:

Filtering for English Only: We used a regular expression to filter the dataset, retaining only the samples that contain English alphabets exclusively. Proportional Sampling by Token Length: Starting from 4,000 tokens, we counted the number of samples in increments of 200 tokens. Based on… See the full description on the dataset page: https://huggingface.co/datasets/aeolian83/allenai_llama_3.1_tulu_3_405b_preference_mixture_filtered_10k_sampled.
f
Supplemental Table 2
figshare.com
xlsx
Updated Dec 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sean McAllister (2022). Supplemental Table 2 [Dataset]. http://doi.org/10.6084/m9.figshare.21791753.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21791753.v1
Dataset updated
Dec 29, 2022
Dataset provided by
figshare
Authors
Sean McAllister
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Total number of reads (pairs and single reads post-merge) at each step in the quality control pipeline.
H
JavaScript code for retrieval of MODIS Collection 6 NDSI snow cover at...
beta.hydroshare.org
hydroshare.org
+1more
zip
Updated Feb 11, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Irene Garousi-Nejad; David Tarboton (2022). JavaScript code for retrieval of MODIS Collection 6 NDSI snow cover at SNOTEL sites and a Jupyter Notebook to merge/reprocess data [Dataset]. http://doi.org/10.4211/hs.d287f010b2dd48edb0573415a56d47f8
Explore at:
zip(52.2 KB)Available download formats
Unique identifier
https://doi.org/10.4211/hs.d287f010b2dd48edb0573415a56d47f8
Dataset updated
Feb 11, 2022
Dataset provided by
HydroShare
Authors
Irene Garousi-Nejad; David Tarboton
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered

Description
This JavaScript code has been developed to retrieve NDSI_Snow_Cover from MODIS version 6 for SNOTEL sites using the Google Earth Engine platform. To successfully run the code, you should have a Google Earth Engine account. An input file, called NWM_grid_Western_US_polygons_SNOTEL_ID.zip, is required to run the code. This input file includes 1 km grid cells of the NWM containing SNOTEL sites. You need to upload this input file to the Assets tap in the Google Earth Engine code editor. You also need to import the MOD10A1.006 Terra Snow Cover Daily Global 500m collection to the Google Earth Engine code editor. You may do this by searching for the product name in the search bar of the code editor.

The JavaScript works for s specified time range. We found that the best period is a month, which is the maximum allowable time range to do the computation for all SNOTEL sites on Google Earth Engine. The script consists of two main loops. The first loop retrieves data for the first day of a month up to day 28 through five periods. The second loop retrieves data from day 28 to the beginning of the next month. The results will be shown as graphs on the right-hand side of the Google Earth Engine code editor under the Console tap. To save results as CSV files, open each time-series by clicking on the button located at each graph's top right corner. From the new web page, you can click on the Download CSV button on top.

Here is the link to the script path: https://code.earthengine.google.com/?scriptPath=users%2Figarousi%2Fppr2-modis%3AMODIS-monthly

Then, run the Jupyter Notebook (merge_downloaded_csv_files.ipynb) to merge the downloaded CSV files that are stored for example in a folder called output/from_GEE into one single CSV file which is merged.csv. The Jupyter Notebook then applies some preprocessing steps and the final output is NDSI_FSCA_MODIS_C6.csv.
2013 NOAA Coastal California TopoBathy Merge Project
fisheries.noaa.gov
catalog.data.gov
+1more
html
Updated Feb 1, 2014
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OCM Partners (2014). 2013 NOAA Coastal California TopoBathy Merge Project [Dataset]. https://www.fisheries.noaa.gov/inport/item/49649
Explore at:
htmlAvailable download formats
Dataset updated
Feb 1, 2014
Dataset provided by
OCM Partners
Time period covered
Oct 30, 2013
Area covered

Description
This project merged recently collected topographic, bathymetric, and acoustic elevation data along the entire California coastline from approximately the 10 meter elevation contour out to California's 3 mile state water's boundary.Topographic LiDAR:The topographic lidar data used in this merged project was the 2009-2011 CA Coastal Conservancy Lidar Project. The data were collected between Octob...
IDWE_CHM (NRT_F)
figshare.com
hdf
Updated Jul 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hao Chen (2025). IDWE_CHM (NRT_F) [Dataset]. http://doi.org/10.6084/m9.figshare.28616207.v6
Explore at:
hdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28616207.v6
Dataset updated
Jul 24, 2025
Dataset provided by
Figsharehttp://figshare.com/
Authors
Hao Chen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A near-real-time (NRT) extension of the IDWE_CHM dataset with ongoing daily updates beyond 2023. This NRT product continues to apply the IDWE framework on incoming data, thereby extending the record in near real-time. Users can obtain timely precipitation estimates with the same ~0.1° resolution and methodology consistency as the historical dataset.For a comprehensive description of the project, please refer to:An Incremental Dynamic Weighting Ensemble Framework for Long-Term and NRT Precipitation Predictionhttps://figshare.com/projects/An_Incremental_Dynamic_Weighting_Ensemble_Framework_for_Long-Term_and_NRT_Precipitation_Prediction/241619The IDWE_CHM dataset provides four precipitation variables, all derived from the ensemble framework but with slightly different modeling approaches:ENS_Reg – A purely regression-based merged precipitation estimate. This product is generated by optimally weighting and combining the input datasets (ERA5-Land, IMERG, GSMaP, etc.) using regression, without additional classification. It serves as a baseline for the IDWE approach.ENS_RegCla1, ENS_RegCla2, ENS_RegCla3 – Three variants of a hybrid regression-plus-classification approach (collectively called ENS_RegCla). These are produced by first applying the regression merging (as in ENS_Reg) and then using a classification step to adjust the estimates. The classification is enhanced with incremental learning, meaning the algorithm learns from errors over time. These three variants may correspond to different configurations or epochs of incremental learning, and they generally show improved skill in capturing precipitation occurrence and extremes compared to a regression-only merge.The updates of IDWE_CHM (NRT_F) are temporally coordinated with those of the five datasets integrated in the fusion process, with explicit synchronization maintained for the GPM_3IMERGDF dataset (available at: https://disc.gsfc.nasa.gov/datasets/GPM_3IMERGDF_07/summary?keywords="IMERG final"), which exhibits relative latency compared to other fused datasets.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Mr.Machine (2022). Data Pre-Processing : Data Integration [Dataset]. https://www.kaggle.com/datasets/ilayaraja07/data-preprocessing-data-integration

Data Pre-Processing : Data Integration

Merge - Join - Concatenate

Explore at:

39 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Aug 2, 2022

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Mr.Machine

Description

In this exercise, we'll merge the details of students from two datasets, namely student.csv and marks.csv. The student dataset contains columns such as Age, Gender, Grade, and Employed. The marks.csv dataset contains columns such as Mark and City. The Student_id column is common between the two datasets. Follow these steps to complete this exercise

Clear search

Close search

Google apps

Main menu

Data Pre-Processing : Data Integration

RSR1.5 of ICP and CICP algorithms in two steps on US-MERGE and US-SNAP...

Replication Data for: Bespoke NPO Taxonomies - Step 02: Merge and Refine...

Joiner

RSR1.5 and computation time with the same configuration for different...

Legal-Linguistic Path Dependence and the Scalability of Cultural Industries:...

🛠 Steps to Run in Google Colab

AFSC/REFM: Digitized 2005 GOA Trawl Logbooks merged with Fish Ticket and...

STEPwise Survey for Non Communicable Diseases Risk Factors 2005 - Zimbabwe

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Sampling deviation

Mode of data collection

Research instrument

Cleaning operations

Response rate

A combined global ocean pCO2 climatology combining open ocean and coastal...

Seamless high-resolution soil moisture from the synergistic merging of the...

HuggingFaceH4_ultrafeedback_binarized_filtered_10k_sampled

SynC Data Sets

allenai_llama_3.1_tulu_3_405b_preference_mixture_filtered_10k_sampled

Supplemental Table 2

JavaScript code for retrieval of MODIS Collection 6 NDSI snow cover at...

2013 NOAA Coastal California TopoBathy Merge Project

IDWE_CHM (NRT_F)

Data Pre-Processing : Data Integration

Merge - Join - Concatenate