17 datasets found
  1. Data Pre-Processing : Data Integration

    • kaggle.com
    Updated Aug 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mr.Machine (2022). Data Pre-Processing : Data Integration [Dataset]. https://www.kaggle.com/datasets/ilayaraja07/data-preprocessing-data-integration
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 2, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Mr.Machine
    Description

    In this exercise, we'll merge the details of students from two datasets, namely student.csv and marks.csv. The student dataset contains columns such as Age, Gender, Grade, and Employed. The marks.csv dataset contains columns such as Mark and City. The Student_id column is common between the two datasets. Follow these steps to complete this exercise

  2. f

    RSR1.5 of ICP and CICP algorithms in two steps on US-MERGE and US-SNAP...

    • plos.figshare.com
    xls
    Updated Jun 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hengkai Guo; Guijin Wang; Lingyun Huang; Yuxin Hu; Chun Yuan; Rui Li; Xihai Zhao (2023). RSR1.5 of ICP and CICP algorithms in two steps on US-MERGE and US-SNAP datasets. [Dataset]. https://plos.figshare.com/articles/dataset/RSR_sub_1_5_sub_of_ICP_and_CICP_algorithms_in_two_steps_on_US_MERGE_and_US_SNAP_datasets_/2296066
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 15, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Hengkai Guo; Guijin Wang; Lingyun Huang; Yuxin Hu; Chun Yuan; Rui Li; Xihai Zhao
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    United States
    Description

    RSR1.5 of ICP and CICP algorithms in two steps on US-MERGE and US-SNAP datasets.

  3. d

    Replication Data for: Bespoke NPO Taxonomies - Step 02: Merge and Refine...

    • search.dataone.org
    Updated Nov 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Santamarina, Francisco (2023). Replication Data for: Bespoke NPO Taxonomies - Step 02: Merge and Refine Data [Dataset]. http://doi.org/10.7910/DVN/EO2HIM
    Explore at:
    Dataset updated
    Nov 19, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Santamarina, Francisco
    Description

    Pre-processed mission statements and additional data from 1023-EZ approvals for 2018 and 2019. For additional information on cleaning steps, please go to the project's replication GitHub page.

  4. d

    Joiner

    • search.dataone.org
    Updated Sep 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HU, Tao (2024). Joiner [Dataset]. http://doi.org/10.7910/DVN/0BM2IQ
    Explore at:
    Dataset updated
    Sep 24, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    HU, Tao
    Description

    The joiner is a component often used in workflows to merge or join data from different sources or intermediate steps into a single output. In the context of Common Workflow Language (CWL), the joiner can be implemented as a step that combines multiple inputs into a cohesive dataset or output. This might involve concatenating files, merging data frames, or aggregating results from different computations.

  5. f

    RSR1.5 and computation time with the same configuration for different...

    • plos.figshare.com
    xls
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hengkai Guo; Guijin Wang; Lingyun Huang; Yuxin Hu; Chun Yuan; Rui Li; Xihai Zhao (2023). RSR1.5 and computation time with the same configuration for different feature-based algorithms on US-MERGE and US-SNAP datasets. [Dataset]. https://plos.figshare.com/articles/dataset/RSR_sub_1_5_sub_and_computation_time_with_the_same_configuration_for_different_feature_based_algorithms_on_US_MERGE_and_US_SNAP_datasets_/2295997
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Hengkai Guo; Guijin Wang; Lingyun Huang; Yuxin Hu; Chun Yuan; Rui Li; Xihai Zhao
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    United States
    Description

    RSR1.5 and computation time with the same configuration for different feature-based algorithms on US-MERGE and US-SNAP datasets.

  6. Legal-Linguistic Path Dependence and the Scalability of Cultural Industries:...

    • zenodo.org
    bin
    Updated Jun 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anon Anon; Anon Anon (2025). Legal-Linguistic Path Dependence and the Scalability of Cultural Industries: From Elizabethan Theater to Global IP Regimes [Dataset]. http://doi.org/10.5281/zenodo.15115958
    Explore at:
    binAvailable download formats
    Dataset updated
    Jun 17, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anon Anon; Anon Anon
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Title:
    Legal-Linguistic Path Dependence and the Scalability of Cultural Industries: From Elizabethan Theater to Global IP Regimes

    Creator:
    Anonymous

    DOI:
    10.5281/zenodo.15115958

    Version:
    v1 — Published March 31, 2025

    License:
    Creative Commons Attribution 4.0 International (CC BY 4.0)

    Description:
    This dataset accompanies the research study investigating how legal origins and language regimes co-evolve to shape the institutional scalability of cultural industries. Through a comparative historical lens focused on Elizabethan England and Habsburg Spain, the dataset supports the claim that English-based common law systems are more conducive to global IP regime formation than Spanish-based civil law systems. The dataset integrates:

    • The 2024 EF English Proficiency Index (EF EPI),

    • 2023 GDP data by country,

    • 2024 International Property Rights Index (IPRI), and

    • UNESCO statistics on film production language.

    These sources have been harmonized for cross-country comparative analysis, including the construction of a “Common Law Dummy” and a merged panel file for empirical testing of the legal-linguistic synergy hypothesis.

    Files Included:

    • EF_EPI_2024_with_Legal_Origin_Common_Law_Dummy.xlsx

    • GDP_2023.xlsx (corrected column header: "Country")

    • IPRI_Country_Tables_Manual.xlsx

    • UNESCO Language of film production - Langue de production des films.xlsx

    🛠 Steps to Run in Google Colab

    Step 1: Correct the Error in GDP_2023.xlsx

    • Open the file in Excel or LibreOffice.

    • Rename the first column from "ountry" to "Country".

    • Save and re-upload.

    Step 2: Upload Files to Google Colab

    1. Open https://colab.research.google.com/

    2. Select File > Upload notebook or create a new one.

    3. Upload all four .xlsx files via the file panel or using:

    python
    CopiarEditar
    from google.colab import files uploaded = files.upload()

    import pandas as pd
    import numpy as np
    import statsmodels.api as sm

    # Load data
    epi_df = pd.read_excel('EF_EPI_2024_with_Legal_Origin_Common_Law_Dummy.xlsx')
    gdp_df = pd.read_excel('GDP_2023.xlsx')
    ipri_df = pd.read_excel('IPRI_Country_Tables_Manual.xlsx')

    # Standardize and rename country columns
    epi_df['Country'] = epi_df['Country'].str.upper()
    gdp_df = gdp_df.rename(columns={'ountry': 'Country'}) # corrects typo in original column name
    gdp_df['Country'] = gdp_df['Country'].str.upper()
    ipri_df['Country'] = ipri_df['COUNTRY'].str.upper()

    # Subset relevant IPRI columns
    ipri_df = ipri_df[['Country', 'Intellectual Property Rights (IPR)']]

    # Merge datasets
    merged_df = epi_df.merge(gdp_df, on='Country', how='inner').merge(ipri_df, on='Country', how='inner')
    print("Merged rows:", merged_df.shape)

    # Create new variables
    merged_df['Log_GDP'] = np.log(merged_df['GDP'])
    merged_df['Interaction'] = merged_df['Common_Law'] * merged_df['English_Lingua_Franca']

    # Define dependent variable
    y = merged_df['Intellectual Property Rights (IPR)']

    # Model 1: without interaction
    X1 = merged_df[['Common_Law', 'English_Lingua_Franca', 'Log_GDP', 'EF EPI Score']]
    X1 = sm.add_constant(X1)

    # Model 2: with interaction
    X2 = merged_df[['Common_Law', 'English_Lingua_Franca', 'Interaction', 'Log_GDP', 'EF EPI Score']]
    X2 = sm.add_constant(X2)

    # Fit OLS models with robust standard errors (HC3)
    model1 = sm.OLS(y, X1).fit(cov_type='HC3')
    model2 = sm.OLS(y, X2).fit(cov_type='HC3')

    # Print results
    print(" === Model 1 Results ===")
    print(model1.summary())

    print(" === Model 2 Results ===")
    print(model2.summary())

  7. d

    AFSC/REFM: Digitized 2005 GOA Trawl Logbooks merged with Fish Ticket and...

    • catalog.data.gov
    • fisheries.noaa.gov
    • +1more
    Updated Jun 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (Point of Contact, Custodian) (2025). AFSC/REFM: Digitized 2005 GOA Trawl Logbooks merged with Fish Ticket and Observer data [Dataset]. https://catalog.data.gov/dataset/afsc-refm-digitized-2005-goa-trawl-logbooks-merged-with-fish-ticket-and-observer-data1
    Explore at:
    Dataset updated
    Jun 1, 2025
    Dataset provided by
    (Point of Contact, Custodian)
    Description

    The data include a full year of logbook forms for vessels 60-124 feet in length (the partial coverage fleet) that had participated in the trawl flatfish fishery of 2005 in the Gulf of Alaska. The digitized hauls were not restricted exclusively to the population of trips to the Gulf of Alaska (GOA), since some vessels also participated in BSAI trawl fisheries. A total of 55 unique vessels daily fishing logbooks (9 catcher-processors and 46 catcher vessels) were digitized into the Vessel Log System database. The daily production section for catcher-processors was not digitized, therefore they were excluded from the data entry procedure and we focus on the remaining catcher vessels. These logbook records are then combined with observer and fish ticket data for the same vessels to create a more complete accounting of each vessels activity in 2005. In order to examine the utility, uniqueness, and the congruence of data contained in the logbooks with other sources, we collated vessel records from logbook data with Alaska Commercial Fisheries Entry Commission (CFEC) fish tickets (retrieved from the Alaska Fisheries Information Network (AKFIN)) and the North Pacific Groundfish Observer Program observer records. Merging of datasets was a multiple-step process. The first merge of data was between the quality-controlled observer and fish ticket data. Prior to 2007, the observer program did not track trip-level information such as the date of departure and return to/from port, or landing date. Consequently, to combine the 2005 haul-level observer data with the trip-level data from the fish tickets for a given vessel, each observer haul was merged with a fish ticket record if the haul retrieval date from the observer data was contained within in the modified start and end date derived from the fish ticket data (see above). Since the starting date on the fish ticket record represents the date fishing began, rather than the date a vessel left port, all observer haul records should be within the time frame of the fish ticket start and end dates. The observer hauls were therefore given the same trip number as determined by the fish tickets trip numbering algorithm. The same process was then repeated to merge each logbook haul onto the combined fish ticket and observer data. Trip targets were then assigned from the North Pacific Fishery Management Council comprehensive observer database (Council.Comprehensive_obs) for observed trips, and statistical areas denoted on the fish tickets were mapped to Fishery Management Plan (FMP) areas. After quality control, the dataset was considered complete, and is referred to as the combined dataset.

  8. STEPwise Survey for Non Communicable Diseases Risk Factors 2005 - Zimbabwe

    • datacatalog.ihsn.org
    • catalog.ihsn.org
    Updated Jun 26, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ministry of Health and Child Welfare (2017). STEPwise Survey for Non Communicable Diseases Risk Factors 2005 - Zimbabwe [Dataset]. https://datacatalog.ihsn.org/catalog/6968
    Explore at:
    Dataset updated
    Jun 26, 2017
    Dataset provided by
    World Health Organizationhttps://who.int/
    Ministry of Health and Child Welfare
    Time period covered
    2005
    Area covered
    Zimbabwe
    Description

    Abstract

    Noncommunicable diseases are the top cause of deaths. In 2008, more than 36 million people worldwide died of such diseases. Ninety per cent of those lived in low-income and middle-income countries.WHO Maps Noncommunicable Disease Trends in All Countries The STEPS Noncommunicable Disease Risk Factor Survey, part of the STEPwise approach to surveillance (STEPS) Adult Risk Factor Surveillance project by the World Health Organization (WHO), is a survey methodology to help countries begin to develop their own surveillance system to monitor and fight against noncommunicable diseases. The methodology prescribes three steps—questionnaire, physical measurements, and biochemical measurements. The steps consist of core items, core variables, and optional modules. Core topics covered by most surveys are demographics, health status, and health behaviors. These provide data on socioeconomic risk factors and metabolic, nutritional, and lifestyle risk factors. Details may differ from country to country and from year to year.

    The general objective of the Zimbabwe NCD STEPS survey was to assess the risk factors of selected NCDs in the adult population of Zimbabwe using the WHO STEPwise approach to non-communicable diseases surveillance. The specific objectives were: - To assess the distribution of life-style factors (physical activity, tobacco and alcohol use), and anthropometric measurements (body mass index and central obesity) which may impact on diabetes and cardiovascular risk factors. - To identify dietary practices that are risk factors for selected NCDs. - To determine the prevalence and determinants of hypertension - To determine the prevalence and determinants of diabetes. - To determine the prevalence and determinants of serum lipid profile.

    Geographic coverage

    Mashonaland Central, Midlands and Matebeleland South Provinces.

    Analysis unit

    Household Individual

    Universe

    The survey comprised of individuals aged 25 years and over.

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    A multistage sampling strategy with 3 stages consisting of province, district and health centre was employed. The World Health Organization STEPwise Approach (STEPS) was used as the design basis for the survey. The 3 randomly selected provinces for the survey were Mashonaland Central, Midlands and Matebeleland South. In each Province four districts were chosen and four health centres were surveyed per district. The survey comprised of individuals aged 25 years and over.The survey was carried out on 3,081 respondents consisting of 1,189 from Midlands,944 from Mashonaland Central and 948 from Matebeleland South. A detailed description of the sampling process is provided in sections 3.8 -3.9. if the survey report provided under the related materials tab.

    Sampling deviation

    Designing a community-based survey such as this one is fraught with difficulties in ensuring representativeness of the sample chosen. In this survey there was a preponderance of female respondents because of the pattern of employment of males and females which also influences urban rural migration.

    The response rate in Midlands was lower than the other two provinces in both STEP 2 and 3. This notable difference was due to the fact that Midlands had more respondents sampled from the urban communities. A higher proportion of urban respondents was formally employed and therefore did not complete STEP 2 and 3 due to conflict with work schedules.

    Mode of data collection

    Face-to-face [f2f]

    Research instrument

    In this survey all the core and selected expanded and optional variables were collected. In addition a food frequency questionnaire and a UNICEF developed questionnaire, the Fortification Rapid Assessment Tool (FRAT) were administered to elicit relevant dietary information.

    Cleaning operations

    Data entry for Step 1 and Step 2 data was carried out as soon as data became available to the data management team. Step 3 data became available in October and data entry was carried out when data quality checks were completed in November. Report writing started in September and a preliminary report became available in December 2005.

    Training of data entry clerks Five data entry clerks were recruited and trained for one week. The selection of data entry clerks was based on their performance during previous research carried out by the MOH&CW. The training of the data entry clerks involved the following: - Familiarization with the NCD, FRAT and FFQ questionnaires. - Familiarization with the data entry template. - Development of codes for open-ended questions. - Statistical package (EPI Info 6). - Development of a data entry template using EPI6. - Development of check files for each template - Trial runs (mock runs) to check whether template was complete and user friendly for data entry. - Double entry (what it involves and how to do it and why it should be done). - Pre-primary data cleaning (check whether denominators are tallying) of the data entry template was done.

    Data Entry for NCD, FRAT and FFQ questionnaires The questionnaires were sequentially numbered and were then divided among the five data entry clerks. Each one of the data entry clerks had a unique identifier for quality control purposes. Hence, the data was entered into five separate files using the statistical package EPI Info version 6.0. The data entry clerks inter-changed their files for double entry and validation of the data. Preliminary data cleaning was done for each of the five files. The five files were then merged to give a single file. The merged file was then transferred to STATA Version 7.0 using Stat Transfer version 5.0.

    Data Cleaning A data-cleaning workshop was held with the core research team members. The objectives of the workshop were: 1. To check all data entry errors. 2. To assess any inconsistencies in data filling. 3. To assess any inconsistencies in data entry. 4. To assess completeness of the data entered.

    Data Merging There were two datasets (NCD questionnaire dataset and laboratory dataset) after the data entry process. The two files were merged by joining corresponding observations from the NCD questionnaire dataset with those from the laboratory dataset into single observations using a unique identifier. The ID number was chosen as the unique identifier since it appeared in both data sets. The main aim of merging was to combine the two datasets containing information on behaviour of individuals and the NCD laboratory parameters. When the two data sets were merged, a new merge variable was created. The merge variable took values 1, 2 and 3. Merge variable==1 Observation appeared in the NCD questionnaire data set but a corresponding observation was not in the laboratory data set Merge variable==2 Observation appeared in the laboratory data set but a corresponding observation did not appear in the questionnaire data set Merge variable==3 Observation appeared in both data sets and reflects a complete merge of the two data sets.

    Data Cleaning After Merging Data cleaning involved identifying the observations where the merge variable values were either 1 or 2. Merge status for each observation was also changed after effecting any corrections. The other two unique variables that were used in the cleaning were Province, district and health centre since they also appeared in both data sets.

    Objectives of cleaning: 1. Match common variables in both data sets and identify inconsistencies in other matching variables e.g. province, district and health centre. 2. To check for any data entry errors.

    Response rate

    A total of 3,081 respondents were included in the survey against an estimated sample size of 3,000. The response rate for Step 1 was 80% for and for Step 2 70% taking Step 1 accrual as being 100%.

  9. d

    A combined global ocean pCO2 climatology combining open ocean and coastal...

    • catalog.data.gov
    Updated Jul 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (Point of Contact) (2025). A combined global ocean pCO2 climatology combining open ocean and coastal areas (NCEI Accession 0209633) [Dataset]. https://catalog.data.gov/dataset/a-combined-global-ocean-pco2-climatology-combining-open-ocean-and-coastal-areas-ncei-accession-
    Explore at:
    Dataset updated
    Jul 1, 2025
    Dataset provided by
    (Point of Contact)
    Description

    This dataset contains the partial pressure of carbon dioxide (pCO2) climatology that was created by merging 2 published and publicly available pCO2 datasets covering the open ocean (Landschützer et. al 2016) and the coastal ocean (Laruelle et. al 2017). Both fields were initially created using a 2-step neural network technique. In a first step, the global ocean is divided into 16 biogeochemical provinces using a self-organizing map. In a second step, the non-linear relationship between variables known to drive the surface ocean carbon system and gridded observations from the SOCAT open and coastal ocean datasets (Bakker et. al 2016) is reconstructed using a feed-forward neural network within each province separately. The final product is then produced by projecting driving variables, e.g., surface temperature, chlorophyll, mixed layer depth, and atmospheric CO2 onto oceanic pCO2 using these non-linear relationships (see Landschützer et. al 2016 and Laruelle et. al 2017 for more detail). This results in monthly open ocean pCO2 fields at 1°x1° resolution and coastal ocean pCO2 fields at 0.25°x0.25° resolution. To merge the products, we divided each 1°x1° open ocean bin into 16 equal 0.25°x0.25° bins without any interpolation. The common overlap area of the products has been merged by scaling the respective products by their mismatch compared to observations from the SOCAT datasets (see Landschützer et. al 2020).

  10. Seamless high-resolution soil moisture from the synergistic merging of the...

    • zenodo.org
    zip
    Updated Jun 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Fiifi Tawia Hagan; Seokhyeon Kim; Guojie Wang; Xiaowen Ma; Robin van der Schalie; Yifan Hu; Yi Y. Liu; Alexander Barth; Haonan Liu; Waheed Ullah; Isaac K. Nooni; Asher S. Bhatti; Daniel Fiifi Tawia Hagan; Seokhyeon Kim; Guojie Wang; Xiaowen Ma; Robin van der Schalie; Yifan Hu; Yi Y. Liu; Alexander Barth; Haonan Liu; Waheed Ullah; Isaac K. Nooni; Asher S. Bhatti (2024). Seamless high-resolution soil moisture from the synergistic merging of the FengYun-3 satellite observations series [Dataset]. http://doi.org/10.5281/zenodo.11501751
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 6, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Daniel Fiifi Tawia Hagan; Seokhyeon Kim; Guojie Wang; Xiaowen Ma; Robin van der Schalie; Yifan Hu; Yi Y. Liu; Alexander Barth; Haonan Liu; Waheed Ullah; Isaac K. Nooni; Asher S. Bhatti; Daniel Fiifi Tawia Hagan; Seokhyeon Kim; Guojie Wang; Xiaowen Ma; Robin van der Schalie; Yifan Hu; Yi Y. Liu; Alexander Barth; Haonan Liu; Waheed Ullah; Isaac K. Nooni; Asher S. Bhatti
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    These datasets are results from merging three FengYun passive microwave soil moisture observations at a 15kmx15km spatial resolution from 2011 to 2020 with continuous extension as data becomes available. Here, we rely on a merging technique that minimizes mean square error (MSE) using the signal-to-noise ratio (SNRopt) of the input parent products to first merge subdaily soil moisture products into dail averages. From these, these are gap-filled using a Data INterpolating Convolutional Auto-Encoder, DINCAE (FY3_Reoconstructed_*). The advantage of this method is that it comes with error variances(FY3_ErVar_*) for each pixel and time step which are useful for sevral applications.

  11. h

    HuggingFaceH4_ultrafeedback_binarized_filtered_10k_sampled

    • huggingface.co
    Updated Apr 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    jungki son (2025). HuggingFaceH4_ultrafeedback_binarized_filtered_10k_sampled [Dataset]. https://huggingface.co/datasets/aeolian83/HuggingFaceH4_ultrafeedback_binarized_filtered_10k_sampled
    Explore at:
    Dataset updated
    Apr 10, 2025
    Authors
    jungki son
    Description

    Origin Datasets: HuggingFaceH4/ultrafeedback_binarized Dataset Sampling for Merge-Up SLM Training To prepare a dataset of 100,000 samples for Merge-Up SLM training, the following steps were taken:

    Filtering for English Only: We used a regular expression to filter the dataset, retaining only the samples that contain English alphabets exclusively. Proportional Sampling by Token Length: Starting from 4,000 tokens, we counted the number of samples in increments of 200 tokens. Based on the… See the full description on the dataset page: https://huggingface.co/datasets/aeolian83/HuggingFaceH4_ultrafeedback_binarized_filtered_10k_sampled.

  12. SynC Data Sets

    • figshare.com
    txt
    Updated Apr 2, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zheng Li (2019). SynC Data Sets [Dataset]. http://doi.org/10.6084/m9.figshare.7938644.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Apr 2, 2019
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Zheng Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Generating synthetic population data from multiple raw data sources is a fundamental step for many data science tasks with a wide range of applications. However, despite the presence of a number of ap- proaches such as iterative proportional fitting (IPF) and combinatorial optimization (CO), an efficient and standard framework for handling this type of problems is absent. In this study, we propose a multi-stage frame- work called SynC (short for Synthetic Population via Gaussian Copula) to fill the gap. SynC first removes potential outliers in the data and then fits the filtered data with a Gaussian copula model to correctly capture dependencies and marginals distributions of sampled survey data. Fi- nally, SynC leverages neural networks to merge datasets into one. Our key contributions include: 1) propose a novel framework for generating individual level data from aggregated data sources by combining state-of- the-art machine learning and statistical techniques, 2) design a metric for validating the accuracy of generated data when the ground truth is hard to obtain, 3) release an easy-to-use framework implementation for repro- ducibility and demonstrate its effectiveness with the Canada National Census data, and 4) present two real-world use cases where datasets of this nature can be leveraged by businesses.

  13. h

    allenai_llama_3.1_tulu_3_405b_preference_mixture_filtered_10k_sampled

    • huggingface.co
    Updated Apr 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    jungki son (2025). allenai_llama_3.1_tulu_3_405b_preference_mixture_filtered_10k_sampled [Dataset]. https://huggingface.co/datasets/aeolian83/allenai_llama_3.1_tulu_3_405b_preference_mixture_filtered_10k_sampled
    Explore at:
    Dataset updated
    Apr 10, 2025
    Authors
    jungki son
    Description

    Origin Datasets: allenai/llama-3.1-tulu-3-405b-preference-mixture Dataset Sampling for Merge-Up SLM Training To prepare a dataset of 100,000 samples for Merge-Up SLM training, the following steps were taken:

    Filtering for English Only: We used a regular expression to filter the dataset, retaining only the samples that contain English alphabets exclusively. Proportional Sampling by Token Length: Starting from 4,000 tokens, we counted the number of samples in increments of 200 tokens. Based on… See the full description on the dataset page: https://huggingface.co/datasets/aeolian83/allenai_llama_3.1_tulu_3_405b_preference_mixture_filtered_10k_sampled.

  14. f

    Supplemental Table 2

    • figshare.com
    xlsx
    Updated Dec 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sean McAllister (2022). Supplemental Table 2 [Dataset]. http://doi.org/10.6084/m9.figshare.21791753.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Dec 29, 2022
    Dataset provided by
    figshare
    Authors
    Sean McAllister
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Total number of reads (pairs and single reads post-merge) at each step in the quality control pipeline.

  15. H

    JavaScript code for retrieval of MODIS Collection 6 NDSI snow cover at...

    • beta.hydroshare.org
    • hydroshare.org
    • +1more
    zip
    Updated Feb 11, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Irene Garousi-Nejad; David Tarboton (2022). JavaScript code for retrieval of MODIS Collection 6 NDSI snow cover at SNOTEL sites and a Jupyter Notebook to merge/reprocess data [Dataset]. http://doi.org/10.4211/hs.d287f010b2dd48edb0573415a56d47f8
    Explore at:
    zip(52.2 KB)Available download formats
    Dataset updated
    Feb 11, 2022
    Dataset provided by
    HydroShare
    Authors
    Irene Garousi-Nejad; David Tarboton
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Description

    This JavaScript code has been developed to retrieve NDSI_Snow_Cover from MODIS version 6 for SNOTEL sites using the Google Earth Engine platform. To successfully run the code, you should have a Google Earth Engine account. An input file, called NWM_grid_Western_US_polygons_SNOTEL_ID.zip, is required to run the code. This input file includes 1 km grid cells of the NWM containing SNOTEL sites. You need to upload this input file to the Assets tap in the Google Earth Engine code editor. You also need to import the MOD10A1.006 Terra Snow Cover Daily Global 500m collection to the Google Earth Engine code editor. You may do this by searching for the product name in the search bar of the code editor.

    The JavaScript works for s specified time range. We found that the best period is a month, which is the maximum allowable time range to do the computation for all SNOTEL sites on Google Earth Engine. The script consists of two main loops. The first loop retrieves data for the first day of a month up to day 28 through five periods. The second loop retrieves data from day 28 to the beginning of the next month. The results will be shown as graphs on the right-hand side of the Google Earth Engine code editor under the Console tap. To save results as CSV files, open each time-series by clicking on the button located at each graph's top right corner. From the new web page, you can click on the Download CSV button on top.

    Here is the link to the script path: https://code.earthengine.google.com/?scriptPath=users%2Figarousi%2Fppr2-modis%3AMODIS-monthly

    Then, run the Jupyter Notebook (merge_downloaded_csv_files.ipynb) to merge the downloaded CSV files that are stored for example in a folder called output/from_GEE into one single CSV file which is merged.csv. The Jupyter Notebook then applies some preprocessing steps and the final output is NDSI_FSCA_MODIS_C6.csv.

  16. 2013 NOAA Coastal California TopoBathy Merge Project

    • fisheries.noaa.gov
    • catalog.data.gov
    • +1more
    html
    Updated Feb 1, 2014
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OCM Partners (2014). 2013 NOAA Coastal California TopoBathy Merge Project [Dataset]. https://www.fisheries.noaa.gov/inport/item/49649
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Feb 1, 2014
    Dataset provided by
    OCM Partners
    Time period covered
    Oct 30, 2013
    Area covered
    Description

    This project merged recently collected topographic, bathymetric, and acoustic elevation data along the entire California coastline from approximately the 10 meter elevation contour out to California's 3 mile state water's boundary.Topographic LiDAR:The topographic lidar data used in this merged project was the 2009-2011 CA Coastal Conservancy Lidar Project. The data were collected between Octob...

  17. IDWE_CHM (NRT_F)

    • figshare.com
    hdf
    Updated Jul 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hao Chen (2025). IDWE_CHM (NRT_F) [Dataset]. http://doi.org/10.6084/m9.figshare.28616207.v6
    Explore at:
    hdfAvailable download formats
    Dataset updated
    Jul 24, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Hao Chen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A near-real-time (NRT) extension of the IDWE_CHM dataset with ongoing daily updates beyond 2023. This NRT product continues to apply the IDWE framework on incoming data, thereby extending the record in near real-time. Users can obtain timely precipitation estimates with the same ~0.1° resolution and methodology consistency as the historical dataset.For a comprehensive description of the project, please refer to:An Incremental Dynamic Weighting Ensemble Framework for Long-Term and NRT Precipitation Predictionhttps://figshare.com/projects/An_Incremental_Dynamic_Weighting_Ensemble_Framework_for_Long-Term_and_NRT_Precipitation_Prediction/241619The IDWE_CHM dataset provides four precipitation variables, all derived from the ensemble framework but with slightly different modeling approaches:ENS_Reg – A purely regression-based merged precipitation estimate. This product is generated by optimally weighting and combining the input datasets (ERA5-Land, IMERG, GSMaP, etc.) using regression, without additional classification. It serves as a baseline for the IDWE approach.ENS_RegCla1, ENS_RegCla2, ENS_RegCla3 – Three variants of a hybrid regression-plus-classification approach (collectively called ENS_RegCla). These are produced by first applying the regression merging (as in ENS_Reg) and then using a classification step to adjust the estimates. The classification is enhanced with incremental learning, meaning the algorithm learns from errors over time. These three variants may correspond to different configurations or epochs of incremental learning, and they generally show improved skill in capturing precipitation occurrence and extremes compared to a regression-only merge.The updates of IDWE_CHM (NRT_F) are temporally coordinated with those of the five datasets integrated in the fusion process, with explicit synchronization maintained for the GPM_3IMERGDF dataset (available at: https://disc.gsfc.nasa.gov/datasets/GPM_3IMERGDF_07/summary?keywords="IMERG final"), which exhibits relative latency compared to other fused datasets.

  18. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Mr.Machine (2022). Data Pre-Processing : Data Integration [Dataset]. https://www.kaggle.com/datasets/ilayaraja07/data-preprocessing-data-integration
Organization logo

Data Pre-Processing : Data Integration

Merge - Join - Concatenate

Explore at:
39 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 2, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Mr.Machine
Description

In this exercise, we'll merge the details of students from two datasets, namely student.csv and marks.csv. The student dataset contains columns such as Age, Gender, Grade, and Employed. The marks.csv dataset contains columns such as Mark and City. The Student_id column is common between the two datasets. Follow these steps to complete this exercise

Search
Clear search
Close search
Google apps
Main menu