100+ datasets found
  1. o

    Data and Code for: Deep Learning for Economists

    • openicpsr.org
    delimited
    Updated Nov 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Melissa Dell (2024). Data and Code for: Deep Learning for Economists [Dataset]. http://doi.org/10.3886/E210922V1
    Explore at:
    delimitedAvailable download formats
    Dataset updated
    Nov 13, 2024
    Dataset provided by
    American Economic Association
    Authors
    Melissa Dell
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    1877 - 2012
    Area covered
    United Kingdom, United States
    Description

    Deep learning provides powerful methods to impute structured information from large-scale, unstructured text and image datasets. For example, economists might wish to detect the presence of economic activity in satellite images, or to measure the topics or entities mentioned in social media, the congressional record, or firm filings. This review introduces deep neural networks, covering methods such as classifiers, regression models, generative AI, and embedding models. Applications include classification, document digitization, record linkage, and methods for data exploration in massive scale text and image corpora. When suitable methods are used, deep learning models can be cheap to tune and can scale affordably to problems involving millions or billions of data points.. The review is accompanied by a regularly updated companion website, https://econdl.github.io/}{EconDL, with user-friendly demo notebooks, software resources, and a knowledge base that provides technical details and additional applications.

  2. Data and code files for co-occurrence modeling project

    • catalog.data.gov
    • datadiscoverystudio.org
    • +2more
    Updated Nov 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2020). Data and code files for co-occurrence modeling project [Dataset]. https://catalog.data.gov/dataset/data-and-code-files-for-co-occurrence-modeling-project
    Explore at:
    Dataset updated
    Nov 12, 2020
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    Files included are original data inputs on stream fishes (fish_data_OEPA_2012.csv), water chemistry (OEPA_WATER_2012.csv), geographic data (NHD_Plus_StreamCat); modeling files for generating predictions from the original data, including the R code (MVP_R_Final.txt) and Stan code (MV_Probit_Stan_Final.txt); and the model output file containing predictions for all NHDPlus catchments in the East Fork Little Miami River watershed (MVP_EFLMR_cooc_Final). This dataset is associated with the following publication: Martin, R., E. Waits, and C. Nietch. Empirically-based modeling and mapping to consider the co-occurrence of ecological receptors and stressors. SCIENCE OF THE TOTAL ENVIRONMENT. Elsevier BV, AMSTERDAM, NETHERLANDS, 613(614): 1228-1239, (2018).

  3. h

    Code-Preference-Pairs

    • huggingface.co
    Updated Jul 28, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Code-Preference-Pairs [Dataset]. https://huggingface.co/datasets/Vezora/Code-Preference-Pairs
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 28, 2024
    Authors
    Vezora
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Creator Nicolas Mejia-Petit My Kofi

      Code-Preference-Pairs Dataset
    
    
    
    
    
      Overview
    

    This dataset was created while created Open-critic-GPT. Here is a little Overview: The Open-Critic-GPT dataset is a synthetic dataset created to train models in both identifying and fixing bugs in code. The dataset is generated using a unique synthetic data pipeline which involves:

    Prompting a local model with an existing code example. Introducing bugs into the code. While also having the model… See the full description on the dataset page: https://huggingface.co/datasets/Vezora/Code-Preference-Pairs.

  4. Data to Replicate paper Improving Bug Detection via Context-based Code...

    • zenodo.org
    bin
    Updated Mar 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Li Yi; Li Yi (2020). Data to Replicate paper Improving Bug Detection via Context-based Code Representation Learning and Attention-based Neural Networks part 2 [Dataset]. http://doi.org/10.5281/zenodo.3719225
    Explore at:
    binAvailable download formats
    Dataset updated
    Mar 21, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Li Yi; Li Yi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data to Replicate paper "Improving Bug Detection via Context-based Code Representation Learning and Attention-based Neural Networks" part 2.

    The author of the paper uploaded dataset to Google Drive. These are the same files, uploaded to Zenodo. Since detection_data.tar.gz exceeded zenodo limits, I split the data into 2 parts detection_data.tar.gz and detection_data.tar.gz. This is the first part. Splitting was achieved on OS X with:

    split -b 31000m "detection_data.tar.gz" "detection_data.tar.gz."

    To get original file back, run

    cat detection_data.tar.gz.* > detection_data.tar.gz

    GitHub link to the project: https://github.com/OOPSLA-2019-BugDetection/OOPSLA-2019-BugDetection

  5. b

    CPRD codes: ICD-10 equivalent code lists for dementia subtypes - Datasets -...

    • data.bris.ac.uk
    Updated Dec 11, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2017). CPRD codes: ICD-10 equivalent code lists for dementia subtypes - Datasets - data.bris [Dataset]. https://data.bris.ac.uk/data/dataset/2h4rmk9v7pw2k23h7vgf9tx1ea
    Explore at:
    Dataset updated
    Dec 11, 2017
    Description

    This dataset contains the ICD-10 code lists used to test the sensitivity and specificity of the Clinical Practice Research Datalink (CPRD) medical code lists for dementia subtypes. The provided code lists are used to define dementia subtypes in linked data from the Hospital Episode Statistic (HES) inpatient dataset and the Office of National Statistics (ONS) death registry, which are then used as the 'gold standard' for comparison against dementia subtypes defined using the CPRD medical code lists. The CPRD medical code lists used in this comparison are available here: Venexia Walker, Neil Davies, Patrick Kehoe, Richard Martin (2017): CPRD codes: neurodegenerative diseases and commonly prescribed drugs. https://doi.org/10.5523/bris.1plm8il42rmlo2a2fqwslwckm2 Complete download (zip, 3.9 KiB)

  6. 4

    Data and code underlying the publication: DCAST: Diverse Class-Aware...

    • data.4tu.nl
    zip
    Updated Nov 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yasin Tepeli; Joana Gonçalves (2024). Data and code underlying the publication: DCAST: Diverse Class-Aware Self-Training Mitigates Selection Bias for Fairer Learning [Dataset]. http://doi.org/10.4121/8648064e-aa7b-4a09-a755-7eb2d90bef66.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 25, 2024
    Dataset provided by
    4TU.ResearchData
    Authors
    Yasin Tepeli; Joana Gonçalves
    License

    https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html

    Description

    This repository consists of Data/Code to reproduce the results of the thesis chapter "DCAST: Diverse Class-Aware Self-Training Mitigates Selection Bias for Fairer Learning".

    The data is shared at: https://doi.org/10.6084/m9.figshare.27003601

    The code is shared at: https://github.com/joanagoncalveslab/DCAST

  7. o

    Data and Code for: Correlation Neglect in Student-to-School Matching

    • openicpsr.org
    delimited
    Updated Jun 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alex Rees-Jones; Ran Shorrer; Chloe Tergiman (2023). Data and Code for: Correlation Neglect in Student-to-School Matching [Dataset]. http://doi.org/10.3886/E192088V1
    Explore at:
    delimitedAvailable download formats
    Dataset updated
    Jun 6, 2023
    Dataset provided by
    American Economic Association
    Authors
    Alex Rees-Jones; Ran Shorrer; Chloe Tergiman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    2019 - 2022
    Area covered
    United States
    Description

    Data and Code to accompany the paper "Correlation Neglect in Student-to-School Matching."Abstract: We present results from three experiments containing incentivized school-choice scenarios. In these scenarios, we vary whether schools' assessments of students are based on a common priority (inducing correlation in admissions decisions) or are based on independent assessments (eliminating correlation in admissions decisions). The quality of students' application strategies declines in the presence of correlated admissions: application strategies become substantially more aggressive and fail to include attractive ``safety'' options. We provide a battery of tests suggesting that this phenomenon is at least partially driven by correlation neglect, and we discuss implications for the design and deployment of student-to-school matching mechanisms.

  8. h

    synthetic-data-code-ir

    • huggingface.co
    Updated Apr 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Susnato Dhar (2025). synthetic-data-code-ir [Dataset]. https://huggingface.co/datasets/susnato/synthetic-data-code-ir
    Explore at:
    Dataset updated
    Apr 21, 2025
    Authors
    Susnato Dhar
    Description

    susnato/synthetic-data-code-ir dataset hosted on Hugging Face and contributed by the HF Datasets community

  9. O

    Code Cases by Officer

    • data.mesaaz.gov
    • citydata.mesaaz.gov
    application/rdfxml +5
    Updated Jul 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Code Cases by Officer [Dataset]. https://data.mesaaz.gov/Code-Compliance/Code-Cases-by-Officer/akj6-u9kt
    Explore at:
    csv, application/rdfxml, tsv, application/rssxml, json, xmlAvailable download formats
    Dataset updated
    Jul 7, 2025
    Dataset authored and provided by
    Code Compliance
    Description

    This dataset is a subset only of Code Enforcement information and is used by the Performance Measure "Code Cases per Code Officer". Some fields are calculated or derived. Full "Code Enforcement" dataset = https://citydata.mesaaz.gov/dataset/Code-Enforcement/hgf6-yenu

  10. pii-comp

    • kaggle.com
    zip
    Updated Apr 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Devin Anzelmo (2024). pii-comp [Dataset]. https://www.kaggle.com/datasets/devinanzelmo/pii-comp
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Apr 18, 2024
    Authors
    Devin Anzelmo
    Description
  11. d

    Data from: Data and code from: Topographic wetness index as a proxy for soil...

    • catalog.data.gov
    • agdatacommons.nal.usda.gov
    Updated Apr 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Data and code from: Topographic wetness index as a proxy for soil moisture in a hillslope catena: flow algorithms and map generalization [Dataset]. https://catalog.data.gov/dataset/data-and-code-from-topographic-wetness-index-as-a-proxy-for-soil-moisture-in-a-hillslope-c-e5e42
    Explore at:
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    Agricultural Research Service
    Description

    This dataset contains all data and code necessary to reproduce the analysis presented in the manuscript: Winzeler, H.E., Owens, P.R., Read Q.D.., Libohova, Z., Ashworth, A., Sauer, T. 2022. 2022. Topographic wetness index as a proxy for soil moisture in a hillslope catena: flow algorithms and map generalization. Land 11:2018. DOI: 10.3390/land11112018. There are several steps to this analysis. The relevant scripts for each are listed below. The first step is to use the raw digital elevation data (DEM) to produce different versions of the topographic wetness index (TWI) for the study region (Calculating TWI). Then, these TWI output files are processed, along with soil moisture (volumetric water content or VWC) time series data from a number of sensors located within the study region, to create analysis-ready data objects (Processing TWI and VWC). Next, models are fit relating TWI to soil moisture (Model fitting) and results are plotted (Visualizing main results). A number of additional analyses were also done (Additional analyses). Input data The DEM of the study region is archived in this dataset as SourceDem.zip. This contains the DEM of the study region (DEM1.sgrd) and associated auxiliary files all called DEM1.* with different extensions. In addition, the DEM is provided as a .tif file called USGS_one_meter_x39y400_AR_R6_WashingtonCO_2015.tif. The remaining data and code files are archived in the repository created with a GitHub release on 2022-10-11, twi-moisture-0.1.zip. The data are found in a subfolder called data. 2017_LoggerData_HEW.csv through 2021_HEW.csv: Soil moisture (VWC) logger data for each year 2017-2021 (5 files total). 2882174.csv: weather data from a nearby station. DryPeriods2017-2021.csv: starting and ending days for dry periods 2017-2021. LoggerLocations.csv: Geographic locations and metadata for each VWC logger. Logger_Locations_TWI_2017-2021.xlsx: 546 topographic wetness indexes calculated at each VWC logger location. note: This is intermediate input created in the first step of the pipeline. Code pipeline To reproduce the analysis in the manuscript run these scripts in the following order. The scripts are all found in the root directory of the repository. See the manuscript for more details on the methods. Calculating TWI TerrainAnalysis.R: Taking the DEM file as input, calculates 546 different topgraphic wetness indexes using a variety of different algorithms. Each algorithm is run multiple times with different input parameters, as described in more detail in the manuscript. After performing this step, it is necessary to use the SAGA-GIS GUI to extract the TWI values for each of the sensor locations. The output generated in this way is included in this repository as Logger_Locations_TWI_2017-2021.xlsx. Therefore it is not necessary to rerun this step of the analysis but the code is provided for completeness. Processing TWI and VWC read_process_data.R: Takes raw TWI and moisture data files and processes them into analysis-ready format, saving the results as CSV. qc_avg_moisture.R: Does additional quality control on the moisture data and averages it across different time periods. Model fitting Models were fit regressing soil moisture (average VWC for a certain time period) against a TWI index, with and without soil depth as a covariate. In each case, for both the model without depth and the model with depth, prediction performance was calculated with and without spatially-blocked cross-validation. Where cross validation wasn't used, we simply used the predictions from the model fit to all the data. fit_combos.R: Models were fit to each combination of soil moisture averaged over 57 months (all months from April 2017-December 2021) and 546 TWI indexes. In addition models were fit to soil moisture averaged over years, and to the grand mean across the full study period. fit_dryperiods.R: Models were fit to soil moisture averaged over previously identified dry periods within the study period (each 1 or 2 weeks in length), again for each of the 546 indexes. fit_summer.R: Models were fit to the soil moisture average for the months of June-September for each of the five years, again for each of the 546 indexes. Visualizing main results Preliminary visualization of results was done in a series of RMarkdown notebooks. All the notebooks follow the same general format, plotting model performance (observed-predicted correlation) across different combinations of time period and characteristics of the TWI indexes being compared. The indexes are grouped by SWI versus TWI, DEM filter used, flow algorithm, and any other parameters that varied. The notebooks show the model performance metrics with and without the soil depth covariate, and with and without spatially-blocked cross-validation. Crossing those two factors, there are four values for model performance for each combination of time period and TWI index presented. performance_plots_bymonth.Rmd: Using the results from the models fit to each month of data separately, prediction performance was averaged by month across the five years of data to show within-year trends. performance_plots_byyear.Rmd: Using the results from the models fit to each month of data separately, prediction performance was averaged by year to show trends across multiple years. performance_plots_dry_periods.Rmd: Prediction performance was presented for the models fit to the previously identified dry periods. performance_plots_summer.Rmd: Prediction performance was presented for the models fit to the June-September moisture averages. Additional analyses Some additional analyses were done that may not be published in the final manuscript but which are included here for completeness. 2019dryperiod.Rmd: analysis, done separately for each day, of a specific dry period in 2019. alldryperiodsbyday.Rmd: analysis, done separately for each day, of the same dry periods discussed above. best_indices.R: after fitting models, this script was used to quickly identify some of the best-performing indexes for closer scrutiny. wateryearfigs.R: exploratory figures showing median and quantile interval of VWC for sensors in low and high TWI locations for each water year. Resources in this dataset:Resource Title: Digital elevation model of study region. File Name: SourceDEM.zipResource Description: .zip archive containing digital elevation model files for the study region. See dataset description for more details.Resource Title: twi-moisture-0.1: Archived git repository containing all other necessary data and code . File Name: twi-moisture-0.1.zipResource Description: .zip archive containing all data and code, other than the digital elevation model archived as a separate file. This file was generated by a GitHub release made on 2022-10-11 of the git repository hosted at https://github.com/qdread/twi-moisture (private repository). See dataset description and README file contained within this archive for more details.

  12. data and code

    • figshare.com
    txt
    Updated Mar 23, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maria Voukelatou (2017). data and code [Dataset]. http://doi.org/10.6084/m9.figshare.4780312.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    Mar 23, 2017
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Maria Voukelatou
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    (raw data for use with accompanying r script)

  13. h

    kl3m-data-ecfr

    • huggingface.co
    Updated Apr 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ALEA Institute (2025). kl3m-data-ecfr [Dataset]. https://huggingface.co/datasets/alea-institute/kl3m-data-ecfr
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 11, 2025
    Authors
    ALEA Institute
    Description

    KL3M Data Project

    Note: This page provides general information about the KL3M Data Project. Additional details specific to this dataset will be added in future updates. For complete information, please visit the GitHub repository or refer to the KL3M Data Project paper.

      Description
    

    This dataset is part of the ALEA Institute's KL3M Data Project, which provides copyright-clean training resources for large language models.

      Dataset Details
    

    Format: Parquet… See the full description on the dataset page: https://huggingface.co/datasets/alea-institute/kl3m-data-ecfr.

  14. Curated Email-Based Code Reviews Datasets

    • figshare.com
    bin
    Updated Feb 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mingzhao Liang; Ping Charoenwet; Patanamon Thongtanunam (2024). Curated Email-Based Code Reviews Datasets [Dataset]. http://doi.org/10.6084/m9.figshare.24679656.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Feb 7, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Mingzhao Liang; Ping Charoenwet; Patanamon Thongtanunam
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Code review is an important practice that improves the overall quality of a proposed patch (i.e. code changes). While much research focused on tool-based code reviews (e.g. a Gerrit code review tool, GitHub), many traditional open-source software (OSS) projects still conduct code reviews through emails. However, due to the nature of unstructured email-based data, it can be challenging to mine email-based code reviews, hindering researchers from delving into the code review practice of such long-standing OSS projects. Therefore, this paper presents large-scale datasets of email-based code reviews of 167 projects across three OSS communities (i.e. Linux Kernel, OzLabs, and FFmpeg). We mined the data from Patchwork, a web-based patch-tracking system for email-based code review, and curated the data by grouping a submitted patch and its revised versions and grouping email aliases. Our datasets include a total of 4.2M patches with 2.1M patch groups and 169K email addresses belonging to 141K individuals. Our published artefacts include the datasets as well as a tool suite to crawl, curate, and store Patchwork data. With our datasets, future work can directly delve into an email-based code review practice of large OSS projects without additional effort in data collection and curation.

  15. h

    finqa-data-processed

    • huggingface.co
    Updated Dec 31, 2004
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Weights and Biases (2004). finqa-data-processed [Dataset]. https://huggingface.co/datasets/wandb/finqa-data-processed
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 31, 2004
    Dataset authored and provided by
    Weights and Biases
    Description

    FinQA Dataset (Processed)

      Dataset Description
    
    
    
    
    
      Dataset Summary
    

    The FinQA dataset is designed for numerical reasoning over financial data, containing questions that require complex reasoning over tables and text from financial reports.

      Dataset Statistics
    

    Total examples: 8281 Training set size: 6624 examples Test set size: 1657 examples

      Dataset Structure
    

    Each example contains:

    Required columns: query: The question to be answered (derived… See the full description on the dataset page: https://huggingface.co/datasets/wandb/finqa-data-processed.

  16. U

    Replication data and code for analyses in R presented in: Volcanic climate...

    • dataverse.ucla.edu
    bin, html, tsv, txt
    Updated Feb 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    R.J. Sinensky; R.J. Sinensky (2022). Replication data and code for analyses in R presented in: Volcanic climate forcing, extreme cold and the Neolithic Transition in the northern US Southwest [Dataset]. http://doi.org/10.25346/S6/N3RVLC
    Explore at:
    tsv(92491), html(6992077), txt(42582), tsv(25713), tsv(44603), bin(28673), tsv(77600), tsv(675537), txt(3689), tsv(431249)Available download formats
    Dataset updated
    Feb 8, 2022
    Dataset provided by
    UCLA Dataverse
    Authors
    R.J. Sinensky; R.J. Sinensky
    License

    https://dataverse.ucla.edu/api/datasets/:persistentId/versions/4.3/customlicense?persistentId=doi:10.25346/S6/N3RVLChttps://dataverse.ucla.edu/api/datasets/:persistentId/versions/4.3/customlicense?persistentId=doi:10.25346/S6/N3RVLC

    Area covered
    Southwestern United States, United States
    Description

    Online Supplemental Material 2 (OSM 2) contains the data and code necessary to generate Figures 3-6, 8-9, S1 and S5-S6 presented in Sinensky et al. (2022). The R Markdown document (OSM 2.0) will render these figures using the data provided in OSM 2.1-2.6.

  17. Build Outcomes and Code Review Data

    • zenodo.org
    zip
    Updated Jun 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Khaled Al-Sabbagh; Khaled Al-Sabbagh (2023). Build Outcomes and Code Review Data [Dataset]. http://doi.org/10.5281/zenodo.8023970
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 20, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Khaled Al-Sabbagh; Khaled Al-Sabbagh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains a collection of datasets that belong to Java-based open-source projects. It encompasses two primary datasets:

    1. Build Outcome and Code Changes Dataset
      This dataset provides histories of build outcomes associated to code changes that belong to 117 Java-based open-source projects.
    2. Code Review Comments and Code Changes Dataset:
    • This dataset provides histories of code review comments translated into code requests and approval for integrations, as well as code changes that belong to two open-source projects.

    Both datasets contain data both before and after applying noise handling techniques.

  18. Data from: Code and data

    • figshare.com
    zip
    Updated Feb 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ritu Lahkar (2025). Code and data [Dataset]. http://doi.org/10.6084/m9.figshare.28423754.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 15, 2025
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Ritu Lahkar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Code to test the data in the article

  19. h

    kl3m-data-govinfo-cri

    • huggingface.co
    Updated Apr 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ALEA Institute (2025). kl3m-data-govinfo-cri [Dataset]. https://huggingface.co/datasets/alea-institute/kl3m-data-govinfo-cri
    Explore at:
    Dataset updated
    Apr 11, 2025
    Authors
    ALEA Institute
    Description

    KL3M Data Project

    Note: This page provides general information about the KL3M Data Project. Additional details specific to this dataset will be added in future updates. For complete information, please visit the GitHub repository or refer to the KL3M Data Project paper.

      Description
    

    This dataset is part of the ALEA Institute's KL3M Data Project, which provides copyright-clean training resources for large language models.

      Dataset Details
    

    Format: Parquet… See the full description on the dataset page: https://huggingface.co/datasets/alea-institute/kl3m-data-govinfo-cri.

  20. o

    Data and code for training and evaluating machine learning models for...

    • explore.openaire.eu
    Updated Dec 1, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Peter Ukkonen; Antti Mäkelä (2018). Data and code for training and evaluating machine learning models for thunderstorm prediction from reanalysis data [Dataset]. http://doi.org/10.5281/zenodo.1480543
    Explore at:
    Dataset updated
    Dec 1, 2018
    Authors
    Peter Ukkonen; Antti Mäkelä
    Description

    FIXED Data and Python code for training and evaluating machine learning models for predicting thunderstorms, associated with the paper: "Evaluation of machine learning classifiers for predicting deep convection" by Peter Ukkonen and Antti Mäkelä (to appear in JAMES) The data (preprocessed inputs and outputs) is stored as netCDF files and .mat files which can be loaded with Python.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Melissa Dell (2024). Data and Code for: Deep Learning for Economists [Dataset]. http://doi.org/10.3886/E210922V1

Data and Code for: Deep Learning for Economists

Explore at:
delimitedAvailable download formats
Dataset updated
Nov 13, 2024
Dataset provided by
American Economic Association
Authors
Melissa Dell
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered
1877 - 2012
Area covered
United Kingdom, United States
Description

Deep learning provides powerful methods to impute structured information from large-scale, unstructured text and image datasets. For example, economists might wish to detect the presence of economic activity in satellite images, or to measure the topics or entities mentioned in social media, the congressional record, or firm filings. This review introduces deep neural networks, covering methods such as classifiers, regression models, generative AI, and embedding models. Applications include classification, document digitization, record linkage, and methods for data exploration in massive scale text and image corpora. When suitable methods are used, deep learning models can be cheap to tune and can scale affordably to problems involving millions or billions of data points.. The review is accompanied by a regularly updated companion website, https://econdl.github.io/}{EconDL, with user-friendly demo notebooks, software resources, and a knowledge base that provides technical details and additional applications.

Search
Clear search
Close search
Google apps
Main menu