100+ datasets found
  1. B

    Open Data Training Video: A proposed data de-identification framework for...

    • borealisdata.ca
    • dataone.org
    Updated Mar 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alishah Mawji; Holly Longstaff; Jessica Trawin; Clare Komugisha; Stefanie K. Novakowski; Matt Wiens; Samuel Akech; Abner Tagoola; Niranjan Kissoon; Mark J. Ansermino (2023). Open Data Training Video: A proposed data de-identification framework for clinical research [Dataset]. http://doi.org/10.5683/SP3/7XYZVC
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 15, 2023
    Dataset provided by
    Borealis
    Authors
    Alishah Mawji; Holly Longstaff; Jessica Trawin; Clare Komugisha; Stefanie K. Novakowski; Matt Wiens; Samuel Akech; Abner Tagoola; Niranjan Kissoon; Mark J. Ansermino
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Objective(s): Data sharing has enormous potential to accelerate and improve the accuracy of research, strengthen collaborations, and restore trust in the clinical research enterprise. Nevertheless, there remains reluctancy to openly share raw data sets, in part due to concerns regarding research participant confidentiality and privacy. We provide an instructional video to describe a standardized de-identification framework that can be adapted and refined based on specific context and risks. Data Description: Training video, presentation slides. Related Resources: The data de-identification algorithm, dataset, and data dictionary that correspond with this training video are available through the Smart Triage sub-Dataverse. NOTE for restricted files: If you are not yet a CoLab member, please complete our membership application survey to gain access to restricted files within 2 business days. Some files may remain restricted to CoLab members. These files are deemed more sensitive by the file owner and are meant to be shared on a case-by-case basis. Please contact the CoLab coordinator on this page under "collaborate with the pediatric sepsis colab."

  2. n

    Data from: Generalizable EHR-R-REDCap pipeline for a national...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Jan 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller (2022). Generalizable EHR-R-REDCap pipeline for a national multi-institutional rare tumor patient registry [Dataset]. http://doi.org/10.5061/dryad.rjdfn2zcm
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 9, 2022
    Dataset provided by
    Harvard Medical School
    Massachusetts General Hospital
    Authors
    Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Objective: To develop a clinical informatics pipeline designed to capture large-scale structured EHR data for a national patient registry.

    Materials and Methods: The EHR-R-REDCap pipeline is implemented using R-statistical software to remap and import structured EHR data into the REDCap-based multi-institutional Merkel Cell Carcinoma (MCC) Patient Registry using an adaptable data dictionary.

    Results: Clinical laboratory data were extracted from EPIC Clarity across several participating institutions. Labs were transformed, remapped and imported into the MCC registry using the EHR labs abstraction (eLAB) pipeline. Forty-nine clinical tests encompassing 482,450 results were imported into the registry for 1,109 enrolled MCC patients. Data-quality assessment revealed highly accurate, valid labs. Univariate modeling was performed for labs at baseline on overall survival (N=176) using this clinical informatics pipeline.

    Conclusion: We demonstrate feasibility of the facile eLAB workflow. EHR data is successfully transformed, and bulk-loaded/imported into a REDCap-based national registry to execute real-world data analysis and interoperability.

    Methods eLAB Development and Source Code (R statistical software):

    eLAB is written in R (version 4.0.3), and utilizes the following packages for processing: DescTools, REDCapR, reshape2, splitstackshape, readxl, survival, survminer, and tidyverse. Source code for eLAB can be downloaded directly (https://github.com/TheMillerLab/eLAB).

    eLAB reformats EHR data abstracted for an identified population of patients (e.g. medical record numbers (MRN)/name list) under an Institutional Review Board (IRB)-approved protocol. The MCCPR does not host MRNs/names and eLAB converts these to MCCPR assigned record identification numbers (record_id) before import for de-identification.

    Functions were written to remap EHR bulk lab data pulls/queries from several sources including Clarity/Crystal reports or institutional EDW including Research Patient Data Registry (RPDR) at MGB. The input, a csv/delimited file of labs for user-defined patients, may vary. Thus, users may need to adapt the initial data wrangling script based on the data input format. However, the downstream transformation, code-lab lookup tables, outcomes analysis, and LOINC remapping are standard for use with the provided REDCap Data Dictionary, DataDictionary_eLAB.csv. The available R-markdown ((https://github.com/TheMillerLab/eLAB) provides suggestions and instructions on where or when upfront script modifications may be necessary to accommodate input variability.

    The eLAB pipeline takes several inputs. For example, the input for use with the ‘ehr_format(dt)’ single-line command is non-tabular data assigned as R object ‘dt’ with 4 columns: 1) Patient Name (MRN), 2) Collection Date, 3) Collection Time, and 4) Lab Results wherein several lab panels are in one data frame cell. A mock dataset in this ‘untidy-format’ is provided for demonstration purposes (https://github.com/TheMillerLab/eLAB).

    Bulk lab data pulls often result in subtypes of the same lab. For example, potassium labs are reported as “Potassium,” “Potassium-External,” “Potassium(POC),” “Potassium,whole-bld,” “Potassium-Level-External,” “Potassium,venous,” and “Potassium-whole-bld/plasma.” eLAB utilizes a key-value lookup table with ~300 lab subtypes for remapping labs to the Data Dictionary (DD) code. eLAB reformats/accepts only those lab units pre-defined by the registry DD. The lab lookup table is provided for direct use or may be re-configured/updated to meet end-user specifications. eLAB is designed to remap, transform, and filter/adjust value units of semi-structured/structured bulk laboratory values data pulls from the EHR to align with the pre-defined code of the DD.

    Data Dictionary (DD)

    EHR clinical laboratory data is captured in REDCap using the ‘Labs’ repeating instrument (Supplemental Figures 1-2). The DD is provided for use by researchers at REDCap-participating institutions and is optimized to accommodate the same lab-type captured more than once on the same day for the same patient. The instrument captures 35 clinical lab types. The DD serves several major purposes in the eLAB pipeline. First, it defines every lab type of interest and associated lab unit of interest with a set field/variable name. It also restricts/defines the type of data allowed for entry for each data field, such as a string or numerics. The DD is uploaded into REDCap by every participating site/collaborator and ensures each site collects and codes the data the same way. Automation pipelines, such as eLAB, are designed to remap/clean and reformat data/units utilizing key-value look-up tables that filter and select only the labs/units of interest. eLAB ensures the data pulled from the EHR contains the correct unit and format pre-configured by the DD. The use of the same DD at every participating site ensures that the data field code, format, and relationships in the database are uniform across each site to allow for the simple aggregation of the multi-site data. For example, since every site in the MCCPR uses the same DD, aggregation is efficient and different site csv files are simply combined.

    Study Cohort

    This study was approved by the MGB IRB. Search of the EHR was performed to identify patients diagnosed with MCC between 1975-2021 (N=1,109) for inclusion in the MCCPR. Subjects diagnosed with primary cutaneous MCC between 2016-2019 (N= 176) were included in the test cohort for exploratory studies of lab result associations with overall survival (OS) using eLAB.

    Statistical Analysis

    OS is defined as the time from date of MCC diagnosis to date of death. Data was censored at the date of the last follow-up visit if no death event occurred. Univariable Cox proportional hazard modeling was performed among all lab predictors. Due to the hypothesis-generating nature of the work, p-values were exploratory and Bonferroni corrections were not applied.

  3. d

    Data and Results for GIS-Based Identification of Areas that have Resource...

    • catalog.data.gov
    • data.usgs.gov
    • +2more
    Updated Nov 13, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Data and Results for GIS-Based Identification of Areas that have Resource Potential for Lode Gold in Alaska [Dataset]. https://catalog.data.gov/dataset/data-and-results-for-gis-based-identification-of-areas-that-have-resource-potential-for-lo
    Explore at:
    Dataset updated
    Nov 13, 2025
    Dataset provided by
    U.S. Geological Survey
    Description

    This data release contains the analytical results and evaluated source data files of geospatial analyses for identifying areas in Alaska that may be prospective for different types of lode gold deposits, including orogenic, reduced-intrusion-related, epithermal, and gold-bearing porphyry. The spatial analysis is based on queries of statewide source datasets of aeromagnetic surveys, Alaska Geochemical Database (AGDB3), Alaska Resource Data File (ARDF), and Alaska Geologic Map (SIM3340) within areas defined by 12-digit HUCs (subwatersheds) from the National Watershed Boundary dataset. The packages of files available for download are: 1. LodeGold_Results_gdb.zip - The analytical results in geodatabase polygon feature classes which contain the scores for each source dataset layer query, the accumulative score, and a designation for high, medium, or low potential and high, medium, or low certainty for a deposit type within the HUC. The data is described by FGDC metadata. An mxd file, and cartographic feature classes are provided for display of the results in ArcMap. An included README file describes the complete contents of the zip file. 2. LodeGold_Results_shape.zip - Copies of the results from the geodatabase are also provided in shapefile and CSV formats. The included README file describes the complete contents of the zip file. 3. LodeGold_SourceData_gdb.zip - The source datasets in geodatabase and geotiff format. Data layers include aeromagnetic surveys, AGDB3, ARDF, lithology from SIM3340, and HUC subwatersheds. The data is described by FGDC metadata. An mxd file and cartographic feature classes are provided for display of the source data in ArcMap. Also included are the python scripts used to perform the analyses. Users may modify the scripts to design their own analyses. The included README files describe the complete contents of the zip file and explain the usage of the scripts. 4. LodeGold_SourceData_shape.zip - Copies of the geodatabase source dataset derivatives from ARDF and lithology from SIM3340 created for this analysis are also provided in shapefile and CSV formats. The included README file describes the complete contents of the zip file.

  4. Z

    Data Analysis for the Systematic Literature Review of DL4SE

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk (2024). Data Analysis for the Systematic Literature Review of DL4SE [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4768586
    Explore at:
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    College of William and Mary
    Washington and Lee University
    Authors
    Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data Analysis is the process that supports decision-making and informs arguments in empirical studies. Descriptive statistics, Exploratory Data Analysis (EDA), and Confirmatory Data Analysis (CDA) are the approaches that compose Data Analysis (Xia & Gong; 2014). An Exploratory Data Analysis (EDA) comprises a set of statistical and data mining procedures to describe data. We ran EDA to provide statistical facts and inform conclusions. The mined facts allow attaining arguments that would influence the Systematic Literature Review of DL4SE.

    The Systematic Literature Review of DL4SE requires formal statistical modeling to refine the answers for the proposed research questions and formulate new hypotheses to be addressed in the future. Hence, we introduce DL4SE-DA, a set of statistical processes and data mining pipelines that uncover hidden relationships among Deep Learning reported literature in Software Engineering. Such hidden relationships are collected and analyzed to illustrate the state-of-the-art of DL techniques employed in the software engineering context.

    Our DL4SE-DA is a simplified version of the classical Knowledge Discovery in Databases, or KDD (Fayyad, et al; 1996). The KDD process extracts knowledge from a DL4SE structured database. This structured database was the product of multiple iterations of data gathering and collection from the inspected literature. The KDD involves five stages:

    Selection. This stage was led by the taxonomy process explained in section xx of the paper. After collecting all the papers and creating the taxonomies, we organize the data into 35 features or attributes that you find in the repository. In fact, we manually engineered features from the DL4SE papers. Some of the features are venue, year published, type of paper, metrics, data-scale, type of tuning, learning algorithm, SE data, and so on.

    Preprocessing. The preprocessing applied was transforming the features into the correct type (nominal), removing outliers (papers that do not belong to the DL4SE), and re-inspecting the papers to extract missing information produced by the normalization process. For instance, we normalize the feature “metrics” into “MRR”, “ROC or AUC”, “BLEU Score”, “Accuracy”, “Precision”, “Recall”, “F1 Measure”, and “Other Metrics”. “Other Metrics” refers to unconventional metrics found during the extraction. Similarly, the same normalization was applied to other features like “SE Data” and “Reproducibility Types”. This separation into more detailed classes contributes to a better understanding and classification of the paper by the data mining tasks or methods.

    Transformation. In this stage, we omitted to use any data transformation method except for the clustering analysis. We performed a Principal Component Analysis to reduce 35 features into 2 components for visualization purposes. Furthermore, PCA also allowed us to identify the number of clusters that exhibit the maximum reduction in variance. In other words, it helped us to identify the number of clusters to be used when tuning the explainable models.

    Data Mining. In this stage, we used three distinct data mining tasks: Correlation Analysis, Association Rule Learning, and Clustering. We decided that the goal of the KDD process should be oriented to uncover hidden relationships on the extracted features (Correlations and Association Rules) and to categorize the DL4SE papers for a better segmentation of the state-of-the-art (Clustering). A clear explanation is provided in the subsection “Data Mining Tasks for the SLR od DL4SE”. 5.Interpretation/Evaluation. We used the Knowledge Discover to automatically find patterns in our papers that resemble “actionable knowledge”. This actionable knowledge was generated by conducting a reasoning process on the data mining outcomes. This reasoning process produces an argument support analysis (see this link).

    We used RapidMiner as our software tool to conduct the data analysis. The procedures and pipelines were published in our repository.

    Overview of the most meaningful Association Rules. Rectangles are both Premises and Conclusions. An arrow connecting a Premise with a Conclusion implies that given some premise, the conclusion is associated. E.g., Given that an author used Supervised Learning, we can conclude that their approach is irreproducible with a certain Support and Confidence.

    Support = Number of occurrences this statement is true divided by the amount of statements Confidence = The support of the statement divided by the number of occurrences of the premise

  5. w

    Data Use in Academia Dataset

    • datacatalog.worldbank.org
    csv, utf-8
    Updated Nov 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Semantic Scholar Open Research Corpus (S2ORC) (2023). Data Use in Academia Dataset [Dataset]. https://datacatalog.worldbank.org/search/dataset/0065200/data_use_in_academia_dataset
    Explore at:
    utf-8, csvAvailable download formats
    Dataset updated
    Nov 27, 2023
    Dataset provided by
    Semantic Scholar Open Research Corpus (S2ORC)
    Brian William Stacy
    License

    https://datacatalog.worldbank.org/public-licenses?fragment=cchttps://datacatalog.worldbank.org/public-licenses?fragment=cc

    Description

    This dataset contains metadata (title, abstract, date of publication, field, etc) for around 1 million academic articles. Each record contains additional information on the country of study and whether the article makes use of data. Machine learning tools were used to classify the country of study and data use.


    Our data source of academic articles is the Semantic Scholar Open Research Corpus (S2ORC) (Lo et al. 2020). The corpus contains more than 130 million English language academic papers across multiple disciplines. The papers included in the Semantic Scholar corpus are gathered directly from publishers, from open archives such as arXiv or PubMed, and crawled from the internet.


    We placed some restrictions on the articles to make them usable and relevant for our purposes. First, only articles with an abstract and parsed PDF or latex file are included in the analysis. The full text of the abstract is necessary to classify the country of study and whether the article uses data. The parsed PDF and latex file are important for extracting important information like the date of publication and field of study. This restriction eliminated a large number of articles in the original corpus. Around 30 million articles remain after keeping only articles with a parsable (i.e., suitable for digital processing) PDF, and around 26% of those 30 million are eliminated when removing articles without an abstract. Second, only articles from the year 2000 to 2020 were considered. This restriction eliminated an additional 9% of the remaining articles. Finally, articles from the following fields of study were excluded, as we aim to focus on fields that are likely to use data produced by countries’ national statistical system: Biology, Chemistry, Engineering, Physics, Materials Science, Environmental Science, Geology, History, Philosophy, Math, Computer Science, and Art. Fields that are included are: Economics, Political Science, Business, Sociology, Medicine, and Psychology. This third restriction eliminated around 34% of the remaining articles. From an initial corpus of 136 million articles, this resulted in a final corpus of around 10 million articles.


    Due to the intensive computer resources required, a set of 1,037,748 articles were randomly selected from the 10 million articles in our restricted corpus as a convenience sample.


    The empirical approach employed in this project utilizes text mining with Natural Language Processing (NLP). The goal of NLP is to extract structured information from raw, unstructured text. In this project, NLP is used to extract the country of study and whether the paper makes use of data. We will discuss each of these in turn.


    To determine the country or countries of study in each academic article, two approaches are employed based on information found in the title, abstract, or topic fields. The first approach uses regular expression searches based on the presence of ISO3166 country names. A defined set of country names is compiled, and the presence of these names is checked in the relevant fields. This approach is transparent, widely used in social science research, and easily extended to other languages. However, there is a potential for exclusion errors if a country’s name is spelled non-standardly.


    The second approach is based on Named Entity Recognition (NER), which uses machine learning to identify objects from text, utilizing the spaCy Python library. The Named Entity Recognition algorithm splits text into named entities, and NER is used in this project to identify countries of study in the academic articles. SpaCy supports multiple languages and has been trained on multiple spellings of countries, overcoming some of the limitations of the regular expression approach. If a country is identified by either the regular expression search or NER, it is linked to the article. Note that one article can be linked to more than one country.


    The second task is to classify whether the paper uses data. A supervised machine learning approach is employed, where 3500 publications were first randomly selected and manually labeled by human raters using the Mechanical Turk service (Paszke et al. 2019).[1] To make sure the human raters had a similar and appropriate definition of data in mind, they were given the following instructions before seeing their first paper:


    Each of these documents is an academic article. The goal of this study is to measure whether a specific academic article is using data and from which country the data came.

    There are two classification tasks in this exercise:

    1. identifying whether an academic article is using data from any country

    2. Identifying from which country that data came.

    For task 1, we are looking specifically at the use of data. Data is any information that has been collected, observed, generated or created to produce research findings. As an example, a study that reports findings or analysis using a survey data, uses data. Some clues to indicate that a study does use data includes whether a survey or census is described, a statistical model estimated, or a table or means or summary statistics is reported.

    After an article is classified as using data, please note the type of data used. The options are population or business census, survey data, administrative data, geospatial data, private sector data, and other data. If no data is used, then mark "Not applicable". In cases where multiple data types are used, please click multiple options.[2]

    For task 2, we are looking at the country or countries that are studied in the article. In some cases, no country may be applicable. For instance, if the research is theoretical and has no specific country application. In some cases, the research article may involve multiple countries. In these cases, select all countries that are discussed in the paper.

    We expect between 10 and 35 percent of all articles to use data.


    The median amount of time that a worker spent on an article, measured as the time between when the article was accepted to be classified by the worker and when the classification was submitted was 25.4 minutes. If human raters were exclusively used rather than machine learning tools, then the corpus of 1,037,748 articles examined in this study would take around 50 years of human work time to review at a cost of $3,113,244, which assumes a cost of $3 per article as was paid to MTurk workers.


    A model is next trained on the 3,500 labelled articles. We use a distilled version of the BERT (bidirectional Encoder Representations for transformers) model to encode raw text into a numeric format suitable for predictions (Devlin et al. (2018)). BERT is pre-trained on a large corpus comprising the Toronto Book Corpus and Wikipedia. The distilled version (DistilBERT) is a compressed model that is 60% the size of BERT and retains 97% of the language understanding capabilities and is 60% faster (Sanh, Debut, Chaumond, Wolf 2019). We use PyTorch to produce a model to classify articles based on the labeled data. Of the 3,500 articles that were hand coded by the MTurk workers, 900 are fed to the machine learning model. 900 articles were selected because of computational limitations in training the NLP model. A classification of “uses data” was assigned if the model predicted an article used data with at least 90% confidence.


    The performance of the models classifying articles to countries and as using data or not can be compared to the classification by the human raters. We consider the human raters as giving us the ground truth. This may underestimate the model performance if the workers at times got the allocation wrong in a way that would not apply to the model. For instance, a human rater could mistake the Republic of Korea for the Democratic People’s Republic of Korea. If both humans and the model perform the same kind of errors, then the performance reported here will be overestimated.


    The model was able to predict whether an article made use of data with 87% accuracy evaluated on the set of articles held out of the model training. The correlation between the number of articles written about each country using data estimated under the two approaches is given in the figure below. The number of articles represents an aggregate total of

  6. QADO: An RDF Representation of Question Answering Datasets and their...

    • figshare.com
    • datasetcatalog.nlm.nih.gov
    zip
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andreas Both; Oliver Schmidtke; Aleksandr Perevalov (2023). QADO: An RDF Representation of Question Answering Datasets and their Analyses for Improving Reproducibility [Dataset]. http://doi.org/10.6084/m9.figshare.21750029.v3
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Andreas Both; Oliver Schmidtke; Aleksandr Perevalov
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Measuring the quality of Question Answering (QA) systems is a crucial task to validate the results of novel approaches. However, there are already indicators of a reproducibility crisis as many published systems have used outdated datasets or use subsets of QA benchmarks, making it hard to compare results. We identified the following core problems: there is no standard data format, instead, proprietary data representations are used by the different partly inconsistent datasets; additionally, the characteristics of datasets are typically not reflected by the dataset maintainers nor by the system publishers. To overcome these problems, we established an ontology---Question Answering Dataset Ontology (QADO)---for representing the QA datasets in RDF. The following datasets were mapped into the ontology: the QALD series, LC-QuAD series, RuBQ series, ComplexWebQuestions, and Mintaka. Hence, the integrated data in QADO covers widely used datasets and multilinguality. Additionally, we did intensive analyses of the datasets to identify their characteristics to make it easier for researchers to identify specific research questions and to select well-defined subsets. The provided resource will enable the research community to improve the quality of their research and support the reproducibility of experiments.

    Here, the mapping results of the QADO process, the SPARQL queries for data analytics, and the archived analytics results file are provided.

    Up-to-date statistics can be created automatically by the script provided at the corresponding QADO GitHub RDFizer repository.

  7. Demographic Traits Annotations

    • kaggle.com
    zip
    Updated Sep 27, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google Health (2019). Demographic Traits Annotations [Dataset]. https://www.kaggle.com/google-health/demographic-traits-annotations
    Explore at:
    zip(16012 bytes)Available download formats
    Dataset updated
    Sep 27, 2019
    Dataset authored and provided by
    Google Health
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Context

    The free-form portions of clinical notes are a significant source of information for research. One path for protecting patient’s privacy is to fully de-identify this information prior to sharing for research purposes . De-identification efforts have focused on known named entities and other known identifier types (names, ages, dates, addresses, ID’s, etc.). However, a note may contain residual “Demographic Traits” (DTs), unique enough to identify the patient when combined with other such facts. While we believe that re-identification is not possible with these demographic traits alone, we hope that giving healthcare organizations the option to remove them will strengthen privacy standards of automatic de-identification systems and bolster their confidence in such systems.

    More specifically, this dataset was used to test the performance of our paper ‘Interactive Deep Learning to Detect Demographic Traits in Free-Form Clinical Notes’. We evaluated our pipeline using a subset of the I2b2 2006 and MIMIC-III datasets.

    Content

    The data contains sentence tagging for MIMIC-III and I2b2 2006 datasets that was used in the paper ‘Interactive Deep Learning to Detect Demographic Traits in Free-Form Clinical Notes’. Every sentence is tagged with its own demographic trait tag (as defined in the "Annotations Guide" file). More formally, the data contains CSV tables each containing rows corresponding to annotated sentences such that every row contains the following example properties: row ID, offset within the note’s text, length and label.

    The label mapping (from character to tag) appears in the "Tagged Categories" file. Furthermore, every note in the MIMIC-III dataset contains a unique row-id (appears in a field within the note). In I2b2 2006, every note also contains a unique number, referred to as record-id (which also appears within the note). These features can be found in our attached CSV's under the row_id and record_id columns appropriately. In both cases the offset is defined from the beginning of the note's text.

  8. Salaries case study

    • kaggle.com
    zip
    Updated Oct 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shobhit Chauhan (2024). Salaries case study [Dataset]. https://www.kaggle.com/datasets/satyam0123/salaries-case-study
    Explore at:
    zip(13105509 bytes)Available download formats
    Dataset updated
    Oct 2, 2024
    Authors
    Shobhit Chauhan
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    To analyze the salaries of company employees using Pandas, NumPy, and other tools, you can structure the analysis process into several steps:

    Case Study: Employee Salary Analysis In this case study, we aim to analyze the salaries of employees across different departments and levels within a company. Our goal is to uncover key patterns, identify outliers, and provide insights that can support decisions related to compensation and workforce management.

    Step 1: Data Collection and Preparation Data Sources: The dataset typically includes employee ID, name, department, position, years of experience, salary, and additional compensation (bonuses, stock options, etc.). Data Cleaning: We use Pandas to handle missing or incomplete data, remove duplicates, and standardize formats. Example: df.dropna() to handle missing salary information, and df.drop_duplicates() to eliminate duplicate entries. Step 2: Data Exploration and Descriptive Statistics Exploratory Data Analysis (EDA): Using Pandas to calculate basic statistics such as mean, median, mode, and standard deviation for employee salaries. Example: df['salary'].describe() provides an overview of the distribution of salaries. Data Visualization: Leveraging tools like Matplotlib or Seaborn for visualizing salary distributions, box plots to detect outliers, and bar charts for department-wise salary breakdowns. Example: sns.boxplot(x='department', y='salary', data=df) provides a visual representation of salary variations by department. Step 3: Analysis Using NumPy Calculating Salary Ranges: NumPy can be used to calculate the range, variance, and percentiles of salary data to identify the spread and skewness of the salary distribution. Example: np.percentile(df['salary'], [25, 50, 75]) helps identify salary quartiles. Correlation Analysis: Identify the relationship between variables such as experience and salary using NumPy to compute correlation coefficients. Example: np.corrcoef(df['years_of_experience'], df['salary']) reveals if experience is a significant factor in salary determination. Step 4: Grouping and Aggregation Salary by Department and Position: Using Pandas' groupby function, we can summarize salary information for different departments and job titles to identify trends or inequalities. Example: df.groupby('department')['salary'].mean() calculates the average salary per department. Step 5: Salary Forecasting (Optional) Predictive Analysis: Using tools such as Scikit-learn, we could build a regression model to predict future salary increases based on factors like experience, education level, and performance ratings. Step 6: Insights and Recommendations Outlier Identification: Detect any employees earning significantly more or less than the average, which could signal inequities or high performers. Salary Discrepancies: Highlight any salary discrepancies between departments or gender that may require further investigation. Compensation Planning: Based on the analysis, suggest potential changes to the salary structure or bonus allocations to ensure fair compensation across the organization. Tools Used: Pandas: For data manipulation, grouping, and descriptive analysis. NumPy: For numerical operations such as percentiles and correlations. Matplotlib/Seaborn: For data visualization to highlight key patterns and trends. Scikit-learn (Optional): For building predictive models if salary forecasting is included in the analysis. This approach ensures a comprehensive analysis of employee salaries, providing actionable insights for human resource planning and compensation strategy.

  9. Data analysis of LiP-MS data for high-throughput applications

    • zenodo.org
    bin, csv, html, xls
    Updated Jan 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Valentina Cappelletti; Valentina Cappelletti; Liliana Malinovska; Liliana Malinovska (2023). Data analysis of LiP-MS data for high-throughput applications [Dataset]. http://doi.org/10.5281/zenodo.5749994
    Explore at:
    bin, html, csv, xlsAvailable download formats
    Dataset updated
    Jan 10, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Valentina Cappelletti; Valentina Cappelletti; Liliana Malinovska; Liliana Malinovska
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Proteins regulate biological processes by changing their structure or abundance to accomplish a specific function. In response to any perturbation or stimulus, protein structure may be altered by a variety of molecular events, such as post translational modifications, protein-protein interactions, aggregation, allostery, or binding to other molecules. The ability to probe these structural changes in thousands of proteins simultaneously in cells or tissues can provide valuable information about the functional state of a variety of biological processes and pathways. Here we present an updated protocol for LiP-MS, a proteomics technique combining limited proteolysis with mass spectrometry, to detect protein structural alterations in complex backgrounds and on a proteome-wide scale (Cappelletti et al., 2021; Piazza et al., 2020; Schopper et al., 2017). We describe advances in the throughput and robustness of the LiP-MS workflow and implementation of data-independent acquisition (DIA) based mass spectrometry, which together achieve high reproducibility and sensitivity, even on large sample sizes. In addition, we introduce MSstatsLiP, an R package dedicated to the analysis of LiP-MS data for the identification of structurally altered peptides and differentially abundant proteins. Altogether, the newly proposed improvements expand the adaptability of the method and allow for its wide use in systematic functional proteomic studies and translational applications.

  10. Store Sales Data 2022~2023

    • kaggle.com
    zip
    Updated Sep 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ta-wei Lo (2024). Store Sales Data 2022~2023 [Dataset]. https://www.kaggle.com/datasets/taweilo/store-sales-data-20222023
    Explore at:
    zip(52192 bytes)Available download formats
    Dataset updated
    Sep 11, 2024
    Authors
    Ta-wei Lo
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This is a case study for the company to improve sales

    Business Goal
    Date: 2023/09/15
    Dataset: Sales quantity of a certain brand from January to December 2022 and from January to September 2023.

    Please describe what you observe (no specific presentation format required). Among your observations, identify at least three valuable insights and explain why you consider them valuable.
    If more resources were available to you (including time, information, etc.), what would you need, and what more could you achieve?

    Metadata of the file Data Period: January 2022 - September 2023 Data Fields: - item - store_id - sales of each month

    Metadata of the file Data Period: January 2022 - September 2023 Data Fields: - item - store_id - sales of each month

    Sample question & answer 1. Product insights: identify the product sales analysis, such as BCG matrix 2. Store insights: identify the sales performance of the sales 3. Supply chain insights: identify the demand 4. Time series forecasting: identify tread, seasonality

    Feel free to leave comments on the discussion. I'd appreciate your upvote if you find my dataset useful! 😀

  11. B

    Quantifying industry spending on promotional events using Open Payments...

    • borealisdata.ca
    • dataone.org
    Updated Jun 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fabian Held (2024). Quantifying industry spending on promotional events using Open Payments data: Event classification script [Dataset]. http://doi.org/10.5683/SP3/0KR09P
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 27, 2024
    Dataset provided by
    Borealis
    Authors
    Fabian Held
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    We conducted a cross-sectional study of the publicly available 2022 Open Payments data to characterize and quantify sponsored events (available for download at: https://www.cms.gov/priorities/key-initiatives/open-payments/data/dataset-downloads). Data sources We downloaded the 2022 dataset ZIP files from the Open Payments website on June 30th, 2023. We included all records for nurse practitioners, clinical nurse specialists, certified registered nurse anesthetists, and certified nurse-midwives (hereafter advanced practiced registered nurses (APRNs)); and allopathic and osteopathic physicians (hereafter, ‘physicians’). To ensure consistency in provider classification, we linked Payments data to the National Plan and Provider Enumeration System data (June 2023) by National Provider Identifier (NPI) and the National Uniform Claim Committee (NUCC) and excluded individuals with an ambiguous provider type. Event-centric analysis of Open Payments records: Creating an event typology We included only payments classified as “food and beverage” to reliably identify distinct sponsored events. We reasoned that food and beverage would be consumed on the same day in the same place, thus assumed that records for food and beverage associated with the same event would share the date of payment and location. We also assumed that the reported value of a food and beverage payment is the total cost of the hospitality divided by the number of attendees, thus grouped payment records with the same amount, rounded to the nearest dollar. Inferring which Open Payment records relate to the same sponsored event requires analytic decisions regarding the selection and representation of variables that define an event. To understand the impact of these choices, we undertook a sensitivity analysis to explore alternative ways to group Open Payments records for food and beverage, to determine how combination of variables, including date (specific date or within the same calendar week), amount (rounded to nearest dollar), and recipient’s state, affected the identification of sponsored events in the Open Payments data set. We chose to define a sponsored event as a cluster of three or more individual payment records for food and beverage (nature of payment) with the following matching Open Payments record variables: • Submitting applicable manufacturer (name) • Product category or therapeutic area • Name of drug or biological or device or medical supply • Recipient state • Total amount of payment (USD, rounded to nearest dollar) • Date of payment (exact) After examining the distribution of the data, we classified events in terms of size (≥20 attendees as “large” and 3-<20 as “small”) and amount per person. We categorized events <$10 as “coffee”, $10-<$30 as “lunch”, $30-<$150 as “dinner”, and ≥$150 as “banquet”.

  12. Ecommerce Consumer Behavior Analysis Data

    • kaggle.com
    zip
    Updated Mar 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Salahuddin Ahmed (2025). Ecommerce Consumer Behavior Analysis Data [Dataset]. https://www.kaggle.com/datasets/salahuddinahmedshuvo/ecommerce-consumer-behavior-analysis-data
    Explore at:
    zip(44265 bytes)Available download formats
    Dataset updated
    Mar 3, 2025
    Authors
    Salahuddin Ahmed
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset provides a comprehensive collection of consumer behavior data that can be used for various market research and statistical analyses. It includes information on purchasing patterns, demographics, product preferences, customer satisfaction, and more, making it ideal for market segmentation, predictive modeling, and understanding customer decision-making processes.

    The dataset is designed to help researchers, data scientists, and marketers gain insights into consumer purchasing behavior across a wide range of categories. By analyzing this dataset, users can identify key trends, segment customers, and make data-driven decisions to improve product offerings, marketing strategies, and customer engagement.

    Key Features: Customer Demographics: Understand age, income, gender, and education level for better segmentation and targeted marketing. Purchase Behavior: Includes purchase amount, frequency, category, and channel preferences to assess spending patterns. Customer Loyalty: Features like brand loyalty, engagement with ads, and loyalty program membership provide insights into long-term customer retention. Product Feedback: Customer ratings and satisfaction levels allow for analysis of product quality and customer sentiment. Decision-Making: Time spent on product research, time to decision, and purchase intent reflect how customers make purchasing decisions. Influences on Purchase: Factors such as social media influence, discount sensitivity, and return rates are included to analyze how external factors affect purchasing behavior.

    Columns Overview: Customer_ID: Unique identifier for each customer. Age: Customer's age (integer). Gender: Customer's gender (categorical: Male, Female, Non-binary, Other). Income_Level: Customer's income level (categorical: Low, Middle, High). Marital_Status: Customer's marital status (categorical: Single, Married, Divorced, Widowed). Education_Level: Highest level of education completed (categorical: High School, Bachelor's, Master's, Doctorate). Occupation: Customer's occupation (categorical: Various job titles). Location: Customer's location (city, region, or country). Purchase_Category: Category of purchased products (e.g., Electronics, Clothing, Groceries). Purchase_Amount: Amount spent during the purchase (decimal). Frequency_of_Purchase: Number of purchases made per month (integer). Purchase_Channel: The purchase method (categorical: Online, In-Store, Mixed). Brand_Loyalty: Loyalty to brands (1-5 scale). Product_Rating: Rating given by the customer to a purchased product (1-5 scale). Time_Spent_on_Product_Research: Time spent researching a product (integer, hours or minutes). Social_Media_Influence: Influence of social media on purchasing decision (categorical: High, Medium, Low, None). Discount_Sensitivity: Sensitivity to discounts (categorical: Very Sensitive, Somewhat Sensitive, Not Sensitive). Return_Rate: Percentage of products returned (decimal). Customer_Satisfaction: Overall satisfaction with the purchase (1-10 scale). Engagement_with_Ads: Engagement level with advertisements (categorical: High, Medium, Low, None). Device_Used_for_Shopping: Device used for shopping (categorical: Smartphone, Desktop, Tablet). Payment_Method: Method of payment used for the purchase (categorical: Credit Card, Debit Card, PayPal, Cash, Other). Time_of_Purchase: Timestamp of when the purchase was made (date/time). Discount_Used: Whether the customer used a discount (Boolean: True/False). Customer_Loyalty_Program_Member: Whether the customer is part of a loyalty program (Boolean: True/False). Purchase_Intent: The intent behind the purchase (categorical: Impulsive, Planned, Need-based, Wants-based). Shipping_Preference: Shipping preference (categorical: Standard, Express, No Preference). Payment_Frequency: Frequency of payment (categorical: One-time, Subscription, Installments). Time_to_Decision: Time taken from consideration to actual purchase (in days).

    Use Cases: Market Segmentation: Segment customers based on demographics, preferences, and behavior. Predictive Analytics: Use data to predict customer spending habits, loyalty, and product preferences. Customer Profiling: Build detailed profiles of different consumer segments based on purchase behavior, social media influence, and decision-making patterns. Retail and E-commerce Insights: Analyze purchase channels, payment methods, and shipping preferences to optimize marketing and sales strategies.

    Target Audience: Data scientists and analysts looking for consumer behavior data. Marketers interested in improving customer segmentation and targeting. Researchers are exploring factors influencing consumer decisions and preferences. Companies aiming to improve customer experience and increase sales through data-driven decisions.

    This dataset is available in CSV format for easy integration into data analysis tools and platforms such as Python, R, and Excel.

  13. AI Training Dataset In Healthcare Market Analysis, Size, and Forecast...

    • technavio.com
    pdf
    Updated Oct 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Technavio (2025). AI Training Dataset In Healthcare Market Analysis, Size, and Forecast 2025-2029 : North America (US, Canada, and Mexico), Europe (Germany, UK, France, Italy, The Netherlands, and Spain), APAC (China, Japan, India, South Korea, Australia, and Indonesia), South America (Brazil, Argentina, and Colombia), Middle East and Africa (UAE, South Africa, and Turkey), and Rest of World (ROW) [Dataset]. https://www.technavio.com/report/ai-training-dataset-in-healthcare-market-industry-analysis
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Oct 9, 2025
    Dataset provided by
    TechNavio
    Authors
    Technavio
    License

    https://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice

    Time period covered
    2025 - 2029
    Area covered
    Canada, United States
    Description

    Snapshot img { margin: 10px !important; } AI Training Dataset In Healthcare Market Size 2025-2029

    The ai training dataset in healthcare market size is forecast to increase by USD 829.0 million, at a CAGR of 23.5% between 2024 and 2029.

    The global AI training dataset in healthcare market is driven by the expanding integration of artificial intelligence and machine learning across the healthcare and pharmaceutical sectors. This technological shift necessitates high-quality, domain-specific data for applications ranging from ai in medical imaging to clinical operations. A key trend involves the adoption of synthetic data generation, which uses techniques like generative adversarial networks to create realistic, anonymized information. This approach addresses the persistent challenges of data scarcity and stringent patient privacy regulations. The development of applied ai in healthcare is dependent on such innovations to accelerate research timelines and foster more equitable model training.This advancement in ai training dataset creation helps circumvent complex legal frameworks and provides a method for data augmentation, especially for rare diseases. However, the market's progress is constrained by an intricate web of data privacy regulations and security mandates. Navigating compliance with laws like HIPAA and GDPR is a primary operational burden, as the process of de-identification is technically challenging and risks catastrophic compliance failures if re-identification occurs. This regulatory complexity, alongside the need for secure infrastructure for protected health information, acts as a bottleneck, impeding market growth and the broader adoption of ai in patient management and ai in precision medicine.

    What will be the Size of the AI Training Dataset In Healthcare Market during the forecast period?

    Explore in-depth regional segment analysis with market size data - historical 2019 - 2023 and forecasts 2025-2029 - in the full report.
    Request Free SampleThe market for AI training datasets in healthcare is defined by the continuous need for high-quality, structured information to power sophisticated machine learning algorithms. The development of AI in precision medicine and ai in cancer diagnostics depends on access to diverse and accurately labeled datasets, including digital pathology images and multi-omics data integration. The focus is shifting toward creating regulatory-grade datasets that can support clinical validation and commercialization of AI-driven diagnostic tools. This involves advanced data harmonization techniques and robust AI governance protocols to ensure reliability and safety in all applications.Progress in this sector is marked by the evolution from single-modality data to complex multimodal datasets. This shift supports a more holistic analysis required for applications like generative AI in clinical trials and treatment efficacy prediction. Innovations in synthetic data generation and federated learning platforms are addressing key challenges related to patient data privacy and data accessibility. These technologies enable the creation of large-scale, analysis-ready assets while adhering to strict compliance frameworks, supporting the ongoing advancement of applied AI in healthcare and fostering collaborative research environments.

    How is this AI Training Dataset In Healthcare Industry segmented?

    The ai training dataset in healthcare industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in "USD million" for the period 2025-2029, as well as historical data from 2019 - 2023 for the following segments. TypeImageTextOthersComponentSoftwareServicesApplicationMedical imagingElectronic health recordsWearable devicesTelemedicineOthersGeographyNorth AmericaUSCanadaMexicoEuropeGermanyUKFranceItalyThe NetherlandsSpainAPACChinaJapanIndiaSouth KoreaAustraliaIndonesiaSouth AmericaBrazilArgentinaColombiaMiddle East and AfricaUAESouth AfricaTurkeyRest of World (ROW)

    By Type Insights

    The image segment is estimated to witness significant growth during the forecast period.The image data segment is the most mature and largest component of the market, driven by the central role of imaging in modern diagnostics. This category includes modalities such as radiology images, digital pathology whole-slide images, and ophthalmology scans. The development of computer vision models and other AI models is a key factor, with these algorithms designed to improve the diagnostic capabilities of clinicians. Applications include identifying cancerous lesions, segmenting organs for pre-operative planning, and quantifying disease progression in neurological scans.The market for these datasets is sustained by significant technical and logistical hurdles, including the need for regulatory approval for AI-based medical devices, which elevates the demand for high-quality training datasets. The market'

  14. Albero study: a longitudinal database of the social network and personal...

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    bin, csv
    Updated Mar 26, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Isidro Maya Jariego; Isidro Maya Jariego; Daniel Holgado Ramos; Daniel Holgado Ramos; Deniza Alieva; Deniza Alieva (2021). Albero study: a longitudinal database of the social network and personal networks of a cohort of students at the end of high school [Dataset]. http://doi.org/10.5281/zenodo.3532048
    Explore at:
    bin, csvAvailable download formats
    Dataset updated
    Mar 26, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Isidro Maya Jariego; Isidro Maya Jariego; Daniel Holgado Ramos; Daniel Holgado Ramos; Deniza Alieva; Deniza Alieva
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ABSTRACT

    The Albero study analyzes the personal transitions of a cohort of high school students at the end of their studies. The data consist of (a) the longitudinal social network of the students, before (n = 69) and after (n = 57) finishing their studies; and (b) the longitudinal study of the personal networks of each of the participants in the research. The two observations of the complete social network are presented in two matrices in Excel format. For each respondent, two square matrices of 45 alters of their personal networks are provided, also in Excel format. For each respondent, both psychological sense of community and frequency of commuting is provided in a SAV file (SPSS). The database allows the combined analysis of social networks and personal networks of the same set of individuals.

    INTRODUCTION

    Ecological transitions are key moments in the life of an individual that occur as a result of a change of role or context. This is the case, for example, of the completion of high school studies, when young people start their university studies or try to enter the labor market. These transitions are turning points that carry a risk or an opportunity (Seidman & French, 2004). That is why they have received special attention in research and psychological practice, both from a developmental point of view and in the situational analysis of stress or in the implementation of preventive strategies.

    The data we present in this article describe the ecological transition of a group of young people from Alcala de Guadaira, a town located about 16 kilometers from Seville. Specifically, in the “Albero” study we monitored the transition of a cohort of secondary school students at the end of the last pre-university academic year. It is a turning point in which most of them began a metropolitan lifestyle, with more displacements to the capital and a slight decrease in identification with the place of residence (Maya-Jariego, Holgado & Lubbers, 2018).

    Normative transitions, such as the completion of studies, affect a group of individuals simultaneously, so they can be analyzed both individually and collectively. From an individual point of view, each student stops attending the institute, which is replaced by new interaction contexts. Consequently, the structure and composition of their personal networks are transformed. From a collective point of view, the network of friendships of the cohort of high school students enters into a gradual process of disintegration and fragmentation into subgroups (Maya-Jariego, Lubbers & Molina, 2019).

    These two levels, individual and collective, were evaluated in the “Albero” study. One of the peculiarities of this database is that we combine the analysis of a complete social network with a survey of personal networks in the same set of individuals, with a longitudinal design before and after finishing high school. This allows combining the study of the multiple contexts in which each individual participates, assessed through the analysis of a sample of personal networks (Maya-Jariego, 2018), with the in-depth analysis of a specific context (the relationships between a promotion of students in the institute), through the analysis of the complete network of interactions. This potentially allows us to examine the covariation of the social network with the individual differences in the structure of personal networks.

    PARTICIPANTS

    The social network and personal networks of the students of the last two years of high school of an institute of Alcala de Guadaira (Seville) were analyzed. The longitudinal follow-up covered approximately a year and a half. The first wave was composed of 31 men (44.9%) and 38 women (55.1%) who live in Alcala de Guadaira, and who mostly expect to live in Alcala (36.2%) or in Seville (37.7%) in the future. In the second wave, information was obtained from 27 men (47.4%) and 30 women (52.6%).

    DATE STRUCTURE AND ARCHIVES FORMAT

    The data is organized in two longitudinal observations, with information on the complete social network of the cohort of students of the last year, the personal networks of each individual and complementary information on the sense of community and frequency of metropolitan movements, among other variables.

    Social network

    The file “Red_Social_t1.xlsx” is a valued matrix of 69 actors that gathers the relations of knowledge and friendship between the cohort of students of the last year of high school in the first observation. The file “Red_Social_t2.xlsx” is a valued matrix of 57 actors obtained 17 months after the first observation.

    The data is organized in two longitudinal observations, with information on the complete social network of the cohort of students of the last year, the personal networks of each individual and complementary information on the sense of community and frequency of metropolitan movements, among other variables.

    In order to generate each complete social network, the list of 77 students enrolled in the last year of high school was passed to the respondents, asking that in each case they indicate the type of relationship, according to the following values: 1, “his/her name sounds familiar"; 2, "I know him/her"; 3, "we talk from time to time"; 4, "we have good relationship"; and 5, "we are friends." The two resulting complete networks are represented in Figure 2. In the second observation, it is a comparatively less dense network, reflecting the gradual disintegration process that the student group has initiated.

    Personal networks

    Also in this case the information is organized in two observations. The compressed file “Redes_Personales_t1.csv” includes 69 folders, corresponding to personal networks. Each folder includes a valued matrix of 45 alters in CSV format. Likewise, in each case a graphic representation of the network obtained with Visone (Brandes and Wagner, 2004) is included. Relationship values range from 0 (do not know each other) to 2 (know each other very well).

    Second, the compressed file “Redes_Personales_t2.csv” includes 57 folders, with the information equivalent to each respondent referred to the second observation, that is, 17 months after the first interview. The structure of the data is the same as in the first observation.

    Sense of community and metropolitan displacements

    The SPSS file “Albero.sav” collects the survey data, together with some information-summary of the network data related to each respondent. The 69 rows correspond to the 69 individuals interviewed, and the 118 columns to the variables related to each of them in T1 and T2, according to the following list:

    • Socio-economic data.

    • Data on habitual residence.

    • Information on intercity journeys.

    • Identity and sense of community.

    • Personal network indicators.

    • Social network indicators.

    DATA ACCESS

    Social networks and personal networks are available in CSV format. This allows its use directly with UCINET, Visone, Pajek or Gephi, among others, and they can be exported as Excel or text format files, to be used with other programs.

    The visual representation of the personal networks of the respondents in both waves is available in the following album of the Graphic Gallery of Personal Networks on Flickr: <https://www.flickr.com/photos/25906481@N07/albums/72157667029974755>.

    In previous work we analyzed the effects of personal networks on the longitudinal evolution of the socio-centric network. It also includes additional details about the instruments applied. In case of using the data, please quote the following reference:

    • Maya-Jariego, I., Holgado, D. & Lubbers, M. J. (2018). Efectos de la estructura de las redes personales en la red sociocéntrica de una cohorte de estudiantes en transición de la enseñanza secundaria a la universidad. Universitas Psychologica, 17(1), 86-98. https://doi.org/10.11144/Javeriana.upsy17-1.eerp

    The English version of this article can be downloaded from: https://tinyurl.com/yy9s2byl

    CONCLUSION

    The database of the “Albero” study allows us to explore the co-evolution of social networks and personal networks. In this way, we can examine the mutual dependence of individual trajectories and the structure of the relationships of the cohort of students as a whole. The complete social network corresponds to the same context of interaction: the secondary school. However, personal networks collect information from the different contexts in which the individual participates. The structural properties of personal networks may partly explain individual differences in the position of each student in the entire social network. In turn, the properties of the entire social network partly determine the structure of opportunities in which individual trajectories are displayed.

    The longitudinal character and the combination of the personal networks of individuals with a common complete social network, make this database have unique characteristics. It may be of interest both for multi-level analysis and for the study of individual differences.

    ACKNOWLEDGEMENTS

    The fieldwork for this study was supported by the Complementary Actions of the Ministry of Education and Science (SEJ2005-25683), and was part of the project “Dynamics of actors and networks across levels: individuals,

  15. H

    Data for Optical Character Recognition Applied to Hieratic: Sign...

    • dataverse.harvard.edu
    • search.dataone.org
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julius A. Tabin (2023). Data for Optical Character Recognition Applied to Hieratic: Sign Identification and Broad Analysis [Dataset]. http://doi.org/10.7910/DVN/D8CWVZ
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Julius A. Tabin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data consists of a number of .zip files containing everything needed to run the hieratic optical character recognition program presented at https://github.com/jtabin/PaPYrus. The files included are as follows: 1. "Dataset By Sign": This is all 13,134 data set images, categorized in folders by their Gardiner sign. Each image is a black and white .png image of a hieratic sign. The signs are labeled with unique identifiers, corresponding in order to their placement in a text from the 1st (0001) to the 9999th (9999), facsimile maker (1 for Möller, 2 for Poe, 3 for Tabin), provenance (1: Thebes, 2: Lahun, 3: Hatnub, 4: Unknown), and original text (1: Shipwrecked Sailor, 2: Eloquent Peasant B1, 3: Eloquent Peasant R, 4: Sinuhe B, 5: Sinuhe R, 6: Papyrus Prisse, 7: Hymn to Senwosret III, 8: Lahun Temple Files, 9: Will of Wah, 10: Texte aus Hatnub, 11: Papyrus Ebers, 12: Rhind Papyrus, 13: Papyrus Westcar). 2. "Dataset Categorized": This is every data set image, as above, categorized in folders by their provenance, text, and facsimile maker (i.e. where the tags originate from). 3. "Dataset Whole": This is every data set image in one folder. This is what is used for the analyses done by the OCR program. 4. "Precalculated Data Set Stats": This is a collection of .csv files outputted by the "Data Set Prep.ipynb" code (code found on the aforementioned GitHub page). "pxls_16.csv", "pxls_20.csv", and "pxls_25.csv" are the pixel values for every sign in the data set, after they were resized to 16, 20, and 25 pixels, respectively. "datasetstats.csv" includes the aspect ratios and sign names for every sign in the data set. The two files beginning with "A1cut" are the same stats, but after every A1 sign had its tail manually cut off. 5. "Precalculated OCR Results": This is a collection of .csv files outputted by the "Image Identification.ipynb" code (also found on the GitHub page). The files are mostly the product of all of one sign from the data set being run through the OCR program and they are labeled with the name of the sign. These result in columns of signs and their similarity scores when compared to other signs. Some files, such as "randsamp_fullresults.csv", come from other analyses explained in their file names (that file, for instance, is a random sample from the data set).

  16. Data from: Who shares? Who doesn't? Factors associated with openly archiving...

    • zenodo.org
    bin, csv, txt
    Updated Jun 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Heather A. Piwowar; Heather A. Piwowar (2022). Data from: Who shares? Who doesn't? Factors associated with openly archiving raw research data [Dataset]. http://doi.org/10.5061/dryad.mf1sd
    Explore at:
    csv, bin, txtAvailable download formats
    Dataset updated
    Jun 1, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Heather A. Piwowar; Heather A. Piwowar
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Many initiatives encourage investigators to share their raw datasets in hopes of increasing research efficiency and quality. Despite these investments of time and money, we do not have a firm grasp of who openly shares raw research data, who doesn't, and which initiatives are correlated with high rates of data sharing. In this analysis I use bibliometric methods to identify patterns in the frequency with which investigators openly archive their raw gene expression microarray datasets after study publication. Automated methods identified 11,603 articles published between 2000 and 2009 that describe the creation of gene expression microarray data. Associated datasets in best-practice repositories were found for 25% of these articles, increasing from less than 5% in 2001 to 30%-35% in 2007-2009. Accounting for sensitivity of the automated methods, approximately 45% of recent gene expression studies made their data publicly available. First-order factor analysis on 124 diverse bibliometric attributes of the data creation articles revealed 15 factors describing authorship, funding, institution, publication, and domain environments. In multivariate regression, authors were most likely to share data if they had prior experience sharing or reusing data, if their study was published in an open access journal or a journal with a relatively strong data sharing policy, or if the study was funded by a large number of NIH grants. Authors of studies on cancer and human subjects were least likely to make their datasets available. These results suggest research data sharing levels are still low and increasing only slowly, and data is least available in areas where it could make the biggest impact. Let's learn from those with high rates of sharing to embrace the full potential of our research output.

  17. Gamelytics: Mobile Analytics Challenge

    • kaggle.com
    zip
    Updated Feb 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    letocen (2025). Gamelytics: Mobile Analytics Challenge [Dataset]. https://www.kaggle.com/datasets/debs2x/gamelytics-mobile-analytics-challenge
    Explore at:
    zip(66154620 bytes)Available download formats
    Dataset updated
    Feb 16, 2025
    Authors
    letocen
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Gamelytics: Mobile Analytics Challenge 🎮📊

    Subtitle

    Unlock key insights into player behavior, optimize game metrics, and make data-driven decisions!

    Description

    Welcome to the Gamelytics: Mobile Analytics Challenge, a real-world-inspired dataset designed for data enthusiasts eager to dive deep into mobile game analytics. This dataset challenges you to analyze player behavior, evaluate A/B test results, and develop metrics for assessing game event performance.

    Project Context & Tasks

    Task 1: Retention Analysis

    🔍 Objective: Calculate the daily retention rate of players, starting from their registration date.
    📄 Data Sources:
    - reg_data.csv: Contains user registration timestamps (reg_ts) and unique user IDs (uid).
    - auth_data.csv: Contains user login timestamps (auth_ts) and unique user IDs (uid).
    💡 Challenge: Develop a Python function to calculate retention, allowing you to test its performance on both the complete dataset and smaller samples.

    Task 2: A/B Testing for Promotional Offers

    🔍 Objective: Identify the best-performing promotional offer set by comparing key revenue metrics.
    💰 Context:
    - The test group has a 5% higher ARPU than the control group.
    - In the control group, 1928 users out of 202,103 are paying customers.
    - In the test group, 1805 users out of 202,667 are paying customers.
    📊 Data Sources:
    - ab_test.csv: Includes user_id, revenue, and testgroup columns.
    💡 Challenge: Decide which offer set performs best, and determine the appropriate metrics for a robust evaluation.

    Task 3: Event Performance Evaluation in "Plants & Gardens"

    🔍 Objective: Develop metrics to assess the success of a time-limited in-game event where players can earn unique rewards.
    🍃 Context: Players complete levels to win exclusive items, bonuses, or coins. In a variation, players may be penalized (sent back levels) after failed attempts.
    💡 Challenge: Define how metrics should change under the penalty variation and identify KPIs for evaluating event success.

    Dataset Information

    The provided data is split into three files, each detailing a specific aspect of the application. Here's a breakdown:

    1. User Registration Data (reg_data.csv)

    • Records: 1,000,000
    • Columns:
      • reg_ts: Registration time (Unix time, int64)
      • uid: Unique user ID (int64)
    • Memory Usage: 15.3 MB
    • Description: This dataset contains user registration timestamps and IDs. It is clean and contains no missing data.

    2. User Activity Data (auth_data.csv)

    • Records: 9,601,013
    • Columns:
      • auth_ts: Login time (Unix time, int64)
      • uid: Unique user ID (int64)
    • Memory Usage: 146.5 MB
    • Description: This dataset captures user login timestamps and IDs. It is clean and contains no missing data.

    3. A/B Testing Data (ab_test.csv)

    • Records: 404,770
    • Columns:
      • user_id: Unique user ID (int64)
      • revenue: Revenue (int64)
      • testgroup: Test group (object)
    • Memory Usage: ~9.3 MB
    • Description: This dataset provides insights into A/B test results, including revenue and group allocation for each user. It is clean and ready for analysis.

    Inspiration & Benefits

    • Real-World Relevance: Inspired by actual challenges in mobile gaming analytics, this dataset lets you solve meaningful problems.
    • Diverse Data Types: Work with registration logs, activity timestamps, and experimental results to gain a holistic understanding of mobile game data.
    • Skill Building: Perfect for those honing skills in retention analysis, A/B testing, and event-based performance evaluation.
    • Community Driven: Built to inspire collaboration and innovation in the data analytics community. 🚀

    Whether you’re a beginner or an expert, this dataset offers an engaging challenge to sharpen your analytical skills and drive actionable insights. Happy analyzing! 🎉📈

  18. Storage Area Network (San) Market Analysis North America, Europe, APAC,...

    • technavio.com
    pdf
    Updated Aug 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Technavio (2024). Storage Area Network (San) Market Analysis North America, Europe, APAC, South America, Middle East and Africa - US, UK, Canada, Germany, China - Size and Forecast 2024-2028 [Dataset]. https://www.technavio.com/report/storage-area-network-san-market-analysis
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Aug 15, 2024
    Dataset provided by
    TechNavio
    Authors
    Technavio
    License

    https://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice

    Time period covered
    2024 - 2028
    Description

    Snapshot img

    Storage Area Network (SAN) Market Size 2024-2028

    The storage area network (san) market size is forecast to increase by USD 35.46 billion, at a CAGR of 16.83% between 2023 and 2028.

    The market is experiencing significant growth, driven primarily by the increasing need for data backup and redundancy in the context of digital transformation. Businesses are increasingly adopting digital strategies, leading to an explosion of data. SAN technology offers a scalable, flexible, and high-performance solution for managing this data, making it an essential component of modern IT infrastructure. However, this market is not without challenges. Cybersecurity threats pose a significant obstacle, with SANs being a prime target due to their critical role in data management. Ensuring the security of SANs is a top priority for organizations, requiring significant investment in cybersecurity solutions and best practices. Additionally, the complexity of SANs can make implementation and management challenging, necessitating specialized expertise and resources. Companies seeking to capitalize on the opportunities presented by the SAN market must navigate these challenges effectively, investing in robust security measures and building a skilled workforce.

    What will be the Size of the Storage Area Network (SAN) Market during the forecast period?

    Explore in-depth regional segment analysis with market size data - historical 2018-2022 and forecasts 2024-2028 - in the full report.
    Request Free SampleThe market continues to evolve, with dynamic market activities unfolding across various sectors. Hybrid cloud storage solutions are increasingly adopted, integrating SAN with cloud storage for enhanced performance and flexibility. Data management remains a key focus, with de-identification, backup, and retention strategies being continually refined. Software-defined storage and data deduplication are transforming the landscape, enabling optimization of data center infrastructure. Multi-cloud storage and performance tuning are also gaining traction, allowing businesses to manage and distribute data more efficiently. Network Attached Storage (NAS) and Object Storage are complementing SAN, offering different access methods and use cases. Data compression, compression, and archiving are essential for capacity planning and cost optimization. Security remains a top priority, with encryption, masking, and compliance measures being implemented to protect sensitive data. Disaster recovery and data governance are crucial components of a robust data management strategy. File and block level storage, as well as flash storage, offer varying benefits depending on the application. Storage analytics, auditing, and tiered storage solutions provide valuable insights for capacity planning and performance monitoring. Fibre Channel and Ethernet technologies continue to shape the market, while hyperconverged infrastructure and SAN switches offer streamlined management and consolidation. The ongoing evolution of SAN market is driven by the continuous pursuit of improved performance, cost savings, and enhanced security.

    How is this Storage Area Network (SAN) Industry segmented?

    The storage area network (san) industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2024-2028, as well as historical data from 2018-2022 for the following segments. ComponentHardwareSoftwareServicesTechnologyFiber channelFiber channel over ethernetInfinibandiSCSI protocolGeographyNorth AmericaUSCanadaEuropeGermanyUKAPACChinaRest of World (ROW).

    By Component Insights

    The hardware segment is estimated to witness significant growth during the forecast period.The market encompasses hardware, software, and services that interconnect storage devices and servers. Hardware components, a crucial part of this infrastructure, consist of fiber channels and related hardware such as hubs, switches, gateways, directors, and routers. The market's growth is driven by the escalating demand for data backup and high-speed networking. Additionally, the ongoing digital transformation worldwide is anticipated to significantly boost the hardware segment's expansion. Data backup and disaster recovery are essential functions in today's business environment, necessitating the need for efficient and reliable storage solutions. Software-defined storage, data deduplication, compression, and tiered storage are some advanced technologies enhancing data backup and recovery capabilities. Furthermore, data security, compliance, and governance are critical concerns, leading to the adoption of data encryption, masking, and access control mechanisms. Network Attached Storage (NAS) and Cloud Storage offer alternative storage architectures to SAN. Multi-cloud storage and hybrid cloud strategies are gaining traction, necessitating seamless integration and optimization of various st

  19. f

    Agricultural Census, 2010 - Poland

    • microdata.fao.org
    Updated Jan 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Central Statistical Office (CSO) (2021). Agricultural Census, 2010 - Poland [Dataset]. https://microdata.fao.org/index.php/catalog/1706
    Explore at:
    Dataset updated
    Jan 20, 2021
    Dataset authored and provided by
    Central Statistical Office (CSO)
    Time period covered
    2010
    Area covered
    Poland
    Description

    Abstract

    The agricultural census and the survey on agricultural production methods were conducted jointly, i.e. within the same organisational structure, at the same time, and using a single electronic questionnaire and the same methods of data collection and processing. The agricultural census covered about 1.8 million of agricultural holdings. At all farms participating in the census, respondents were asked about the "other gainful activities carried out by the labour force" (OGA). The frame for the full survey was prepared on the basis of the list of holdings prepared for the census. When creating the list, an object-oriented approach was adopted for the first time, which meant that at the first stage the holdings (objects) were identified, their coordinates defined (they were located spatially) and their holders were identified on the basis of data from administrative sources. For domestic purposes, the farms with the smallest area, as well as those of little economic importance (meeting very low national thresholds) were included in the sample survey carried out jointly with the census. The survey on agricultural production methods was conducted on a sample of approximately 200 thousand farms in respect of the precision requirements set out in Regulation (EC) 1166/2008. The frame prepared for the agricultural census was used as the sampling frame.

    Geographic coverage

    National coverage

    Analysis unit

    Households

    Universe

    The statistical unit was the agricultural holding, defined as "an agricultural area, including forest land, buildings or their parts, equipment and stock if they constitute or may constitute an organized economic unit as well as rights related to running the farm". Two types of holding were distinguished (i) the natural persons' holdings (to which thresholds were applied) and (ii) legal persons holdings (no threshold applied).

    Kind of data

    Census/enumeration data [cen]

    Sampling procedure

    (a) Frame The frame for the agricultural census and the survey on agricultural production methods was based on the list of agricultural holdings. In the process of the list of farms creation for the needs of AC and SAPM 2010 the objective approach was used for the first time, which meant that on the first stage of work agricultural holdings were identified, its coordinates were defined (farms were located in space), and its holder was determined according to administrative data as described below. The list creation started from identification of all land parcels used for agricultural purposes. The land parcels found in the set of the Agency for Restructuring and Modernisation of Agriculture (including the Records of holdings and Records of producers) were combined into holding and had their holders defined. For the rest of land parcels, the holders were defined from the Records of Land and Buildings, afterwards the data concerning users were updated by the set of Real Property Tax Record.

    Mode of data collection

    Computer Assisted Personal Interview [capi]

    Research instrument

    A single electronic questionnaire was used for data collection, combining information related to both the AC 2010 and the SAPM. The census covered all 16 core items recommended in the WCA 2010.

    Questionnaire:

    Section 0. Identifying characters Section 1. Land use Section 2. Economic activity Section 3. Income structure Section 4. Sown and other area Section 5. Livestock Section 6. tractor, machines and equipment Section 7. Use of fertilizers Section 8. Labour force Section 9. Agricultural production methods

    Cleaning operations

    a. DATA PROCESSING AND ARCHIVING The data captured through the CAPI, CATI and CAWI channels were gathered in the Operational Microdata Base (OMB) built for the AC 2010 and processed there (including control and correction of data, as well as completing the file obtained in the AC with the data obtained from administrative sources, imputed units and estimation for the SAPM). The data, depersonalized and validated in the OMB, were exported to an Analytical Microdata Base (AMB) to conduct analyses, prepare the data set for transmission to Eurostat and develop multidimensional tables for internal and external users.

    b. CENSUS DATA QUALITY Except for a few isolated cases, the CAPI and CATI method resulted in fully completed questionnaires. The computer applications used enabled controls for completeness and correctness of the data already at the collection stage, also facilitating the use of necessary definitions and clarifications during the questionnaire completion process. A set of detailed questionnaire completion guidelines was developed and delivered during training sessions.

    Data appraisal

    The preliminary results of the agricultural census were published in February 2011 (basic data at the national level), and then in July 2011 in the publication entitled "Report on the Results of the 2010 Agricultural Census" (in a broader thematic scope, at NUTS3 2 level). The final results of the AC 2010 were disseminated by a sequence of publications, covering the main thematic areas of the census. The reference publications were released in paper form, and are available online (www.stat.gov.pl http://www.stat.gov.pl), and on CD-ROMs.

  20. d

    Data from: Database used for the evaluation of data used to identify...

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Nov 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Database used for the evaluation of data used to identify groundwater sources under the direct influence of surface water in Pennsylvania [Dataset]. https://catalog.data.gov/dataset/database-used-for-the-evaluation-of-data-used-to-identify-groundwater-sources-under-the-di
    Explore at:
    Dataset updated
    Nov 21, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    Pennsylvania
    Description

    The U.S. Geological Survey (USGS), in cooperation with the Pennsylvania Department of Environmental Protection (PADEP), conducted an evaluation of data used by the PADEP to identify groundwater sources under the direct influence of surface water (GUDI) in Pennsylvania (Gross and others, 2022). The data used in this evaluation and the processes used to compile them from multiple sources are described and provided herein. Data were compiled primarily but not exclusively from PADEP resources, including (1) source-information for public water-supply systems and Microscopic Particulate Analysis (MPA) results for public water-supply system groundwater sources from the agency’s Pennsylvania Drinking Water Information System (PADWIS) database (Pennsylvania Department of Environmental Protection, 2016), and (2) results associated with MPA testing from the PADEP Bureau of Laboratories (BOL) files and water-quality analyses obtained from the PADEP BOL, Sample Information System (Pennsylvania Department of Environmental Protection, written commun., various dates). Information compiled from sources other than the PADEP includes anthropogenic (land cover and PADEP region) and naturogenic (geologic and physiographic, hydrologic, soil characterization, and topographic) spatial data. Quality control (QC) procedures were applied to the PADWIS database to verify spatial coordinates, verify collection type information, exclude sources not designated as wells, and verify or remove values that were either obvious errors or populated as zero rather than as “no data.” The QC process reduced the original PADWIS dataset to 12,147 public water-supply system wells (hereafter referred to as the PADWIS database). An initial subset of the PADWIS database, termed the PADWIS database subset, was created to include 4,018 public water-supply system community wells that have undergone the Surface Water Identification Protocol (SWIP), a protocol used by the PADEP to classify sources as GUDI or non-GUDI (Gross and others, 2022). A second subset of the PADWIS database, termed the MPA database subset, represents MPA results for 631 community and noncommunity wells and includes water-quality data (alkalinity, chloride, Escherichia coli, fecal coliform, nitrate, pH, sodium, specific conductance, sulfate, total coliform, total dissolved solids, total residue, and turbidity) associated with groundwater-quality samples typically collected concurrently with the MPA sample. The PADWIS database and two subsets (PADWIS database subset and MPA database subset) are compiled in a single data table (DR_2022_Table.xlsx), with the two subsets differentiated using attributes that are defined in the associated metadata table (DR_2022_Metadata_Table_Variables.xlsx). This metadata file (DR_2022_Metadata.xml) describes data resources, data compilation, and QC procedures in greater detail.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Alishah Mawji; Holly Longstaff; Jessica Trawin; Clare Komugisha; Stefanie K. Novakowski; Matt Wiens; Samuel Akech; Abner Tagoola; Niranjan Kissoon; Mark J. Ansermino (2023). Open Data Training Video: A proposed data de-identification framework for clinical research [Dataset]. http://doi.org/10.5683/SP3/7XYZVC

Open Data Training Video: A proposed data de-identification framework for clinical research

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 15, 2023
Dataset provided by
Borealis
Authors
Alishah Mawji; Holly Longstaff; Jessica Trawin; Clare Komugisha; Stefanie K. Novakowski; Matt Wiens; Samuel Akech; Abner Tagoola; Niranjan Kissoon; Mark J. Ansermino
License

Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically

Description

Objective(s): Data sharing has enormous potential to accelerate and improve the accuracy of research, strengthen collaborations, and restore trust in the clinical research enterprise. Nevertheless, there remains reluctancy to openly share raw data sets, in part due to concerns regarding research participant confidentiality and privacy. We provide an instructional video to describe a standardized de-identification framework that can be adapted and refined based on specific context and risks. Data Description: Training video, presentation slides. Related Resources: The data de-identification algorithm, dataset, and data dictionary that correspond with this training video are available through the Smart Triage sub-Dataverse. NOTE for restricted files: If you are not yet a CoLab member, please complete our membership application survey to gain access to restricted files within 2 business days. Some files may remain restricted to CoLab members. These files are deemed more sensitive by the file owner and are meant to be shared on a case-by-case basis. Please contact the CoLab coordinator on this page under "collaborate with the pediatric sepsis colab."

Search
Clear search
Close search
Google apps
Main menu