100+ datasets found
  1. P

    CoDEx Large Dataset

    • paperswithcode.com
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tara Safavi; Danai Koutra, CoDEx Large Dataset [Dataset]. https://paperswithcode.com/dataset/codex-large
    Explore at:
    Authors
    Tara Safavi; Danai Koutra
    Description

    CoDEx comprises a set of knowledge graph completion datasets extracted from Wikidata and Wikipedia that improve upon existing knowledge graph completion benchmarks in scope and level of difficulty. CoDEx comprises three knowledge graphs varying in size and structure, multilingual descriptions of entities and relations, and tens of thousands of hard negative triples that are plausible but verified to be false.

  2. e

    mm2_level

    • hf-proxy-cf.effarig.site
    • huggingface.co
    Updated Feb 15, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Addison (2022). mm2_level [Dataset]. https://hf-proxy-cf.effarig.site/datasets/TheGreatRambler/mm2_level
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 15, 2022
    Authors
    Addison
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Mario Maker 2 levels

    Part of the Mario Maker 2 Dataset Collection

      Dataset Description
    

    The Mario Maker 2 levels dataset consists of 26.6 million levels from Nintendo's online service totaling around 100GB of data. The dataset was created using the self-hosted Mario Maker 2 api over the course of 1 month in February 2022.

      How to use it
    

    The Mario Maker 2 levels dataset is a very large dataset so for most use cases it is recommended to make use of the… See the full description on the dataset page: https://huggingface.co/datasets/TheGreatRambler/mm2_level.

  3. P

    Large Car-following Dataset Based on Lyft level-5: Following Autonomous...

    • paperswithcode.com
    • 4tu.edu.hpc.n-helix.com
    • +1more
    Updated May 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Guopeng Li; Yiru Jiao; Victor L. Knoop; Simeon C. Calvert; J. W. C. van Lint (2023). Large Car-following Dataset Based on Lyft level-5: Following Autonomous Vehicles vs. Human-driven Vehicles Dataset [Dataset]. https://paperswithcode.com/dataset/large-car-following-dataset-based-on-lyft
    Explore at:
    Dataset updated
    May 29, 2023
    Authors
    Guopeng Li; Yiru Jiao; Victor L. Knoop; Simeon C. Calvert; J. W. C. van Lint
    Description

    Studying how human drivers react differently when following autonomous vehicles (AV) vs. human-driven vehicles (HV) is critical for mixed traffic flow. This dataset contains extracted and enhanced two categories of car-following data, HV-following-AV (H-A) and HV-following-HV (H-H), from the open Lyft level-5 dataset.

  4. g

    Spatial data: A large-scale database of modeled contemporary and future...

    • gimi9.com
    • data.usgs.gov
    • +5more
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Spatial data: A large-scale database of modeled contemporary and future water temperature data for 10,774 Michigan, Minnesota and Wisconsin Lakes [Dataset]. https://www.gimi9.com/dataset/data-gov_spatial-data-a-large-scale-database-of-modeled-contemporary-and-future-water-temperature-d/
    Explore at:
    Area covered
    Minnesota, Wisconsin
    Description

    Climate change has been shown to influence lake temperatures globally. To better understand the diversity of lake responses to climate change and give managers tools to manage individual lakes, we modelled daily water temperature profiles for 10,774 lakes in Michigan, Minnesota and Wisconsin for contemporary (1979-2015) and future (2020-2040 and 2080-2100) time periods with climate models based on the Representative Concentration Pathway 8.5, the worst-case emission scenario. From simulated temperatures, we derived commonly used, ecologically relevant annual metrics of thermal conditions for each lake. We included all available supporting metadata including satellite and in-situ observations of water clarity, maximum observed lake depth, land-cover based estimates of surrounding canopy height and observed water temperature profiles (used here for validation). This unique dataset offers landscape-level insight into the future impact of climate change on lakes. This data set contains the following parameters: site_id, Prmnn_I, GNIS_ID, GNIS_Nm, ReachCd, FType, FCode, which are defined below.

  5. P

    MMLU Dataset

    • paperswithcode.com
    Updated Jan 5, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MMLU Dataset [Dataset]. https://paperswithcode.com/dataset/mmlu
    Explore at:
    Dataset updated
    Jan 5, 2025
    Authors
    Dan Hendrycks; Collin Burns; Steven Basart; Andy Zou; Mantas Mazeika; Dawn Song; Jacob Steinhardt
    Description

    MMLU (Massive Multitask Language Understanding) is a new benchmark designed to measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This makes the benchmark more challenging and more similar to how we evaluate humans. The benchmark covers 57 subjects across STEM, the humanities, the social sciences, and more. It ranges in difficulty from an elementary level to an advanced professional level, and it tests both world knowledge and problem solving ability. Subjects range from traditional areas, such as mathematics and history, to more specialized areas like law and ethics. The granularity and breadth of the subjects makes the benchmark ideal for identifying a model’s blind spots.

  6. u

    Data from: CESM1 CAM5 Decadal Prediction Large Ensemble

    • rda.ucar.edu
    • oidc.rda.ucar.edu
    • +1more
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CESM1 CAM5 Decadal Prediction Large Ensemble [Dataset]. https://rda.ucar.edu/lookfordata/datasets/?nb=y&b=topic&v=Atmosphere
    Explore at:
    Description

    The CESM Decadal Prediction Large Ensemble (DPLE) is a set of simulations carried out at NCAR to support research into near-term Earth System prediction. The DPLE comprises 62 distinct ensembles, ... one for each of 62 initialization times (November 1 of 1954, 1955, ..., 2014, 2015). For each start date, a 40-member ensemble was generated by randomly perturbing the atmospheric initial condition at the round-off level. The simulations were integrated forward for 122 months after initialization. Observation-based ocean and sea ice initial conditions for the 1954-2015 period were obtained from a reanalysis-forced simulation of the CESM ocean and sea ice models. The initial conditions for the atmosphere and land models were obtained from CESM Large Ensemble (LENS) simulations at corresponding historical times. Full field initialization was used for all component models, and so drift adjustment prior to analysis is generally recommended.

  7. CompanyKG Dataset V2.0: A Large-Scale Heterogeneous Graph for Company...

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip, bin +1
    Updated Jun 4, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lele Cao; Lele Cao; Vilhelm von Ehrenheim; Vilhelm von Ehrenheim; Mark Granroth-Wilding; Mark Granroth-Wilding; Richard Anselmo Stahl; Richard Anselmo Stahl; Drew McCornack; Drew McCornack; Armin Catovic; Armin Catovic; Dhiana Deva Cavacanti Rocha; Dhiana Deva Cavacanti Rocha (2024). CompanyKG Dataset V2.0: A Large-Scale Heterogeneous Graph for Company Similarity Quantification [Dataset]. http://doi.org/10.5281/zenodo.11391315
    Explore at:
    application/gzip, bin, txtAvailable download formats
    Dataset updated
    Jun 4, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Lele Cao; Lele Cao; Vilhelm von Ehrenheim; Vilhelm von Ehrenheim; Mark Granroth-Wilding; Mark Granroth-Wilding; Richard Anselmo Stahl; Richard Anselmo Stahl; Drew McCornack; Drew McCornack; Armin Catovic; Armin Catovic; Dhiana Deva Cavacanti Rocha; Dhiana Deva Cavacanti Rocha
    Time period covered
    May 29, 2024
    Description

    CompanyKG is a heterogeneous graph consisting of 1,169,931 nodes and 50,815,503 undirected edges, with each node representing a real-world company and each edge signifying a relationship between the connected pair of companies.

    Edges: We model 15 different inter-company relations as undirected edges, each of which corresponds to a unique edge type. These edge types capture various forms of similarity between connected company pairs. Associated with each edge of a certain type, we calculate a real-numbered weight as an approximation of the similarity level of that type. It is important to note that the constructed edges do not represent an exhaustive list of all possible edges due to incomplete information. Consequently, this leads to a sparse and occasionally skewed distribution of edges for individual relation/edge types. Such characteristics pose additional challenges for downstream learning tasks. Please refer to our paper for a detailed definition of edge types and weight calculations.

    Nodes: The graph includes all companies connected by edges defined previously. Each node represents a company and is associated with a descriptive text, such as "Klarna is a fintech company that provides support for direct and post-purchase payments ...". To comply with privacy and confidentiality requirements, we encoded the text into numerical embeddings using four different pre-trained text embedding models: mSBERT (multilingual Sentence BERT), ADA2, SimCSE (fine-tuned on the raw company descriptions) and PAUSE.

    Evaluation Tasks. The primary goal of CompanyKG is to develop algorithms and models for quantifying the similarity between pairs of companies. In order to evaluate the effectiveness of these methods, we have carefully curated three evaluation tasks:

    • Similarity Prediction (SP). To assess the accuracy of pairwise company similarity, we constructed the SP evaluation set comprising 3,219 pairs of companies that are labeled either as positive (similar, denoted by "1") or negative (dissimilar, denoted by "0"). Of these pairs, 1,522 are positive and 1,697 are negative.
    • Competitor Retrieval (CR). Each sample contains one target company and one of its direct competitors. It contains 76 distinct target companies, each of which has 5.3 competitors annotated in average. For a given target company A with N direct competitors in this CR evaluation set, we expect a competent method to retrieve all N competitors when searching for similar companies to A.
    • Similarity Ranking (SR) is designed to assess the ability of any method to rank candidate companies (numbered 0 and 1) based on their similarity to a query company. Paid human annotators, with backgrounds in engineering, science, and investment, were tasked with determining which candidate company is more similar to the query company. It resulted in an evaluation set comprising 1,856 rigorously labeled ranking questions. We retained 20% (368 samples) of this set as a validation set for model development.
    • Edge Prediction (EP) evaluates a model's ability to predict future or missing relationships between companies, providing forward-looking insights for investment professionals. The EP dataset, derived (and sampled) from new edges collected between April 6, 2023, and May 25, 2024, includes 40,000 samples, with edges not present in the pre-existing CompanyKG (a snapshot up until April 5, 2023).

    Background and Motivation

    In the investment industry, it is often essential to identify similar companies for a variety of purposes, such as market/competitor mapping and Mergers & Acquisitions (M&A). Identifying comparable companies is a critical task, as it can inform investment decisions, help identify potential synergies, and reveal areas for growth and improvement. The accurate quantification of inter-company similarity, also referred to as company similarity quantification, is the cornerstone to successfully executing such tasks. However, company similarity quantification is often a challenging and time-consuming process, given the vast amount of data available on each company, and the complex and diversified relationships among them.

    While there is no universally agreed definition of company similarity, researchers and practitioners in PE industry have adopted various criteria to measure similarity, typically reflecting the companies' operations and relationships. These criteria can embody one or more dimensions such as industry sectors, employee profiles, keywords/tags, customers' review, financial performance, co-appearance in news, and so on. Investment professionals usually begin with a limited number of companies of interest (a.k.a. seed companies) and require an algorithmic approach to expand their search to a larger list of companies for potential investment.

    In recent years, transformer-based Language Models (LMs) have become the preferred method for encoding textual company descriptions into vector-space embeddings. Then companies that are similar to the seed companies can be searched in the embedding space using distance metrics like cosine similarity. The rapid advancements in Large LMs (LLMs), such as GPT-3/4 and LLaMA, have significantly enhanced the performance of general-purpose conversational models. These models, such as ChatGPT, can be employed to answer questions related to similar company discovery and quantification in a Q&A format.

    However, graph is still the most natural choice for representing and learning diverse company relations due to its ability to model complex relationships between a large number of entities. By representing companies as nodes and their relationships as edges, we can form a Knowledge Graph (KG). Utilizing this KG allows us to efficiently capture and analyze the network structure of the business landscape. Moreover, KG-based approaches allow us to leverage powerful tools from network science, graph theory, and graph-based machine learning, such as Graph Neural Networks (GNNs), to extract insights and patterns to facilitate similar company analysis. While there are various company datasets (mostly commercial/proprietary and non-relational) and graph datasets available (mostly for single link/node/graph-level predictions), there is a scarcity of datasets and benchmarks that combine both to create a large-scale KG dataset expressing rich pairwise company relations.

    Source Code and Tutorial:
    https://github.com/llcresearch/CompanyKG2

    Paper: to be published

  8. d

    COVID-19 County Level Data - Archive

    • catalog.data.gov
    • data.ct.gov
    • +2more
    Updated Aug 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.ct.gov (2023). COVID-19 County Level Data - Archive [Dataset]. https://catalog.data.gov/dataset/covid-19-county-level-data
    Explore at:
    Dataset updated
    Aug 12, 2023
    Dataset provided by
    data.ct.gov
    Description

    Covid-19 Daily metrics at the county level As of 6/1/2023, this data set is no longer being updated. The COVID-19 Data Report is posted on the Open Data Portal every day at 3pm. The report uses data from multiple sources, including external partners; if data from external partners are not received by 3pm, they are not available for inclusion in the report and will not be displayed. Data that are received after 3pm will still be incorporated and published in the next report update. The cumulative number of COVID-19 cases (cumulative_cases) includes all cases of COVID-19 that have ever been reported to DPH. The cumulative number of COVID_19 cases in the last 7 days (cases_7days) only includes cases where the specimen collection date is within the past 7 days. While most cases are reported to DPH within 48 hours of specimen collection, there are a small number of cases that routinely are delayed, and will have specimen collection dates that fall outside of the rolling 7 day reporting window. Additionally, reporting entities may submit correction files to contribute historic data during initial onboarding or to address data quality issues; while this is rare, these correction files may cause a large amount of data from outside of the current reporting window to be uploaded in a single day; this would result in the change in cumulative_cases being much larger than the value of cases_7days. On June 4, 2020, the US Department of Health and Human Services issued guidance requiring the reporting of positive and negative test results for SARS-CoV-2; this guidance expired with the end of the federal PHE on 5/11/2023, and negative SARS-CoV-2 results were removed from the List of Reportable Laboratory Findings. DPH will no longer be reporting metrics that were dependent on the collection of negative test results, specifically total tests performed or percent positivity. Positive antigen and PCR/NAAT results will continue to be reportable.

  9. High-resolution cone-beam scan of an apple and pebbles with two dosage...

    • zenodo.org
    zip
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sophia Bethany Coban; Marinus J. Lagerwerf; K. Joost Batenburg; Sophia Bethany Coban; Marinus J. Lagerwerf; K. Joost Batenburg (2020). High-resolution cone-beam scan of an apple and pebbles with two dosage levels [Dataset]. http://doi.org/10.5281/zenodo.1475213
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sophia Bethany Coban; Marinus J. Lagerwerf; K. Joost Batenburg; Sophia Bethany Coban; Marinus J. Lagerwerf; K. Joost Batenburg
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We release two tomographic scans with two levels of radiation dosage of two measured objects for noise-level comparative studies in data analysis, reconstruction or segmentation methods. The objects are referred to as apple and pebbles (more specific, hydrograins), respectively. The dataset collected with higher dosage is referred to as the "good" dataset; and the other as the "noisy" dataset, as a way to distinguish between the two dosage levels.

    The dataset are acquired using the custom built and highly flexible CT scanner, FlexRay Lab, developed by XRE NV and located at CWI. This apparatus consists of a cone-beam microfocus X-ray point source that projects polychromatic X-rays onto a 1943-by-1535 pixels, 14-bit, flat detector panel.

    Both dataset were collected over a 360 degrees in circular and continuous motion with 2001 projections distributed evenly over the full circle for the good dataset and 501 projections distributed evenly over the full circle for the noisy dataset. The uploaded dataset are not binned or normalized; a single dark and two (pre- and post-) flat fields are included for each scan. Projections for both sets were collected with 100 ms exposure time with the good data projections averaged over 5 takes, and no averaging was made for the noisy data. The tube settings for the good and noisy dataset were 70kV, 45W and 70kV, 20W, respectively. The total scanning time were 20 minutes for the good; 3 minutes for the noisy scan. Each dataset is packaged with the full list of data and scan settings files (in .txt format). These files contain the tube settings, scan geometry and full list of motor settings.

    These dataset are produced by the Computational Imaging members at Centrum Wiskunde & Informatica (CI-CWI). For any useful Python/MATLAB scripts for FlexRay dataset, we refer the reader to our group's GitHub page.

    For more information or guidance in using these dataset, please get in touch with

    • s.b.coban [at] cwi.nl or
    • m.j.lagerwerf [at] cwi.nl
  10. ERA5 hourly data on single levels from 1940 to present

    • cds.climate.copernicus.eu
    • arcticdata.io
    grib
    Updated Mar 26, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ECMWF (2025). ERA5 hourly data on single levels from 1940 to present [Dataset]. http://doi.org/10.24381/cds.adbb2d47
    Explore at:
    gribAvailable download formats
    Dataset updated
    Mar 26, 2025
    Dataset provided by
    European Centre for Medium-Range Weather Forecastshttp://ecmwf.int/
    Authors
    ECMWF
    License

    https://object-store.os-api.cci2.ecmwf.int:443/cci2-prod-catalogue/licences/licence-to-use-copernicus-products/licence-to-use-copernicus-products_b4b9451f54cffa16ecef5c912c9cebd6979925a956e3fa677976e0cf198c2c18.pdfhttps://object-store.os-api.cci2.ecmwf.int:443/cci2-prod-catalogue/licences/licence-to-use-copernicus-products/licence-to-use-copernicus-products_b4b9451f54cffa16ecef5c912c9cebd6979925a956e3fa677976e0cf198c2c18.pdf

    Time period covered
    Jan 1, 1959 - Mar 20, 2025
    Description

    ERA5 is the fifth generation ECMWF reanalysis for the global climate and weather for the past 8 decades. Data is available from 1940 onwards. ERA5 replaces the ERA-Interim reanalysis. Reanalysis combines model data with observations from across the world into a globally complete and consistent dataset using the laws of physics. This principle, called data assimilation, is based on the method used by numerical weather prediction centres, where every so many hours (12 hours at ECMWF) a previous forecast is combined with newly available observations in an optimal way to produce a new best estimate of the state of the atmosphere, called analysis, from which an updated, improved forecast is issued. Reanalysis works in the same way, but at reduced resolution to allow for the provision of a dataset spanning back several decades. Reanalysis does not have the constraint of issuing timely forecasts, so there is more time to collect observations, and when going further back in time, to allow for the ingestion of improved versions of the original observations, which all benefit the quality of the reanalysis product. ERA5 provides hourly estimates for a large number of atmospheric, ocean-wave and land-surface quantities. An uncertainty estimate is sampled by an underlying 10-member ensemble at three-hourly intervals. Ensemble mean and spread have been pre-computed for convenience. Such uncertainty estimates are closely related to the information content of the available observing system which has evolved considerably over time. They also indicate flow-dependent sensitive areas. To facilitate many climate applications, monthly-mean averages have been pre-calculated too, though monthly means are not available for the ensemble mean and spread. ERA5 is updated daily with a latency of about 5 days. In case that serious flaws are detected in this early release (called ERA5T), this data could be different from the final release 2 to 3 months later. In case that this occurs users are notified. The data set presented here is a regridded subset of the full ERA5 data set on native resolution. It is online on spinning disk, which should ensure fast and easy access. It should satisfy the requirements for most common applications. An overview of all ERA5 datasets can be found in this article. Information on access to ERA5 data on native resolution is provided in these guidelines. Data has been regridded to a regular lat-lon grid of 0.25 degrees for the reanalysis and 0.5 degrees for the uncertainty estimate (0.5 and 1 degree respectively for ocean waves). There are four main sub sets: hourly and monthly products, both on pressure levels (upper air fields) and single levels (atmospheric, ocean-wave and land surface quantities). The present entry is "ERA5 hourly data on single levels from 1940 to present".

  11. d

    Current Population Survey (CPS)

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Damico, Anthony (2023). Current Population Survey (CPS) [Dataset]. http://doi.org/10.7910/DVN/AK4FDD
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Damico, Anthony
    Description

    analyze the current population survey (cps) annual social and economic supplement (asec) with r the annual march cps-asec has been supplying the statistics for the census bureau's report on income, poverty, and health insurance coverage since 1948. wow. the us census bureau and the bureau of labor statistics ( bls) tag-team on this one. until the american community survey (acs) hit the scene in the early aughts (2000s), the current population survey had the largest sample size of all the annual general demographic data sets outside of the decennial census - about two hundred thousand respondents. this provides enough sample to conduct state- and a few large metro area-level analyses. your sample size will vanish if you start investigating subgroups b y state - consider pooling multiple years. county-level is a no-no. despite the american community survey's larger size, the cps-asec contains many more variables related to employment, sources of income, and insurance - and can be trended back to harry truman's presidency. aside from questions specifically asked about an annual experience (like income), many of the questions in this march data set should be t reated as point-in-time statistics. cps-asec generalizes to the united states non-institutional, non-active duty military population. the national bureau of economic research (nber) provides sas, spss, and stata importation scripts to create a rectangular file (rectangular data means only person-level records; household- and family-level information gets attached to each person). to import these files into r, the parse.SAScii function uses nber's sas code to determine how to import the fixed-width file, then RSQLite to put everything into a schnazzy database. you can try reading through the nber march 2012 sas importation code yourself, but it's a bit of a proc freak show. this new github repository contains three scripts: 2005-2012 asec - download all microdata.R down load the fixed-width file containing household, family, and person records import by separating this file into three tables, then merge 'em together at the person-level download the fixed-width file containing the person-level replicate weights merge the rectangular person-level file with the replicate weights, then store it in a sql database create a new variable - one - in the data table 2012 asec - analysis examples.R connect to the sql database created by the 'download all microdata' progr am create the complex sample survey object, using the replicate weights perform a boatload of analysis examples replicate census estimates - 2011.R connect to the sql database created by the 'download all microdata' program create the complex sample survey object, using the replicate weights match the sas output shown in the png file below 2011 asec replicate weight sas output.png statistic and standard error generated from the replicate-weighted example sas script contained in this census-provided person replicate weights usage instructions document. click here to view these three scripts for more detail about the current population survey - annual social and economic supplement (cps-asec), visit: the census bureau's current population survey page the bureau of labor statistics' current population survey page the current population survey's wikipedia article notes: interviews are conducted in march about experiences during the previous year. the file labeled 2012 includes information (income, work experience, health insurance) pertaining to 2011. when you use the current populat ion survey to talk about america, subract a year from the data file name. as of the 2010 file (the interview focusing on america during 2009), the cps-asec contains exciting new medical out-of-pocket spending variables most useful for supplemental (medical spending-adjusted) poverty research. confidential to sas, spss, stata, sudaan users: why are you still rubbing two sticks together after we've invented the butane lighter? time to transition to r. :D

  12. N

    Dataset for Level Plains, AL Census Bureau Income Distribution by Gender

    • neilsberg.com
    Updated Jan 9, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neilsberg Research (2024). Dataset for Level Plains, AL Census Bureau Income Distribution by Gender [Dataset]. https://www.neilsberg.com/research/datasets/b3be1009-abcb-11ee-8b96-3860777c1fe6/
    Explore at:
    Dataset updated
    Jan 9, 2024
    Dataset authored and provided by
    Neilsberg Research
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Level Plains, Alabama
    Dataset funded by
    Neilsberg Research
    Description
    About this dataset

    Context

    The dataset tabulates the Level Plains household income by gender. The dataset can be utilized to understand the gender-based income distribution of Level Plains income.

    Content

    The dataset will have the following datasets when applicable

    Please note: The 2020 1-Year ACS estimates data was not reported by the Census Bureau due to the impact on survey collection and analysis caused by COVID-19. Consequently, median household income data for 2020 is unavailable for large cities (population 65,000 and above).

    • Level Plains, AL annual median income by work experience and sex dataset : Aged 15+, 2010-2022 (in 2022 inflation-adjusted dollars)
    • Level Plains, AL annual income distribution by work experience and gender dataset (Number of individuals ages 15+ with income, 2021)

    Good to know

    Margin of Error

    Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

    Custom data

    If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

    Inspiration

    Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

    Interested in deeper insights and visual analysis?

    Explore our comprehensive data analysis and visual representations for a deeper understanding of Level Plains income distribution by gender. You can refer the same here

  13. P

    MATH Dataset

    • paperswithcode.com
    • opendatalab.com
    • +2more
    Updated Jan 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dan Hendrycks; Collin Burns; Saurav Kadavath; Akul Arora; Steven Basart; Eric Tang; Dawn Song; Jacob Steinhardt (2025). MATH Dataset [Dataset]. https://paperswithcode.com/dataset/math
    Explore at:
    Dataset updated
    Jan 10, 2025
    Authors
    Dan Hendrycks; Collin Burns; Saurav Kadavath; Akul Arora; Steven Basart; Eric Tang; Dawn Song; Jacob Steinhardt
    Description

    MATH is a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations.

  14. N

    Comprehensive Median Household Income and Distribution Dataset for Level...

    • neilsberg.com
    Updated Jan 11, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neilsberg Research (2024). Comprehensive Median Household Income and Distribution Dataset for Level Plains, AL: Analysis by Household Type, Size and Income Brackets [Dataset]. https://www.neilsberg.com/research/datasets/cda82c9e-b041-11ee-aaca-3860777c1fe6/
    Explore at:
    Dataset updated
    Jan 11, 2024
    Dataset authored and provided by
    Neilsberg Research
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Level Plains, Alabama
    Dataset funded by
    Neilsberg Research
    Description
    About this dataset

    Context

    The dataset tabulates the median household income in Level Plains. It can be utilized to understand the trend in median household income and to analyze the income distribution in Level Plains by household type, size, and across various income brackets.

    Content

    The dataset will have the following datasets when applicable

    Please note: The 2020 1-Year ACS estimates data was not reported by the Census Bureau due to the impact on survey collection and analysis caused by COVID-19. Consequently, median household income data for 2020 is unavailable for large cities (population 65,000 and above).

    • Level Plains, AL Median Household Income Trends (2010-2021, in 2022 inflation-adjusted dollars)
    • Median Household Income Variation by Family Size in Level Plains, AL: Comparative analysis across 7 household sizes
    • Income Distribution by Quintile: Mean Household Income in Level Plains, AL
    • Level Plains, AL households by income brackets: family, non-family, and total, in 2022 inflation-adjusted dollars

    Good to know

    Margin of Error

    Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

    Custom data

    If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

    Inspiration

    Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

    Interested in deeper insights and visual analysis?

    Explore our comprehensive data analysis and visual representations for a deeper understanding of Level Plains median household income. You can refer the same here

  15. NOAA Analysis of Record for Calibration (AORC) Dataset

    • registry.opendata.aws
    Updated Mar 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NOAA Analysis of Record for Calibration (AORC) Dataset [Dataset]. https://registry.opendata.aws/noaa-nws-aorc/
    Explore at:
    Dataset updated
    Mar 22, 2024
    Dataset provided by
    National Oceanic and Atmospheric Administrationhttp://www.noaa.gov/
    Description

    The Analysis Of Record for Calibration (AORC) is a gridded record of near-surface weather conditions covering the continental United States and Alaska and their hydrologically contributing areas. It is defined on a latitude/longitude spatial grid with a mesh length of 30 arc seconds (~800 m), and a temporal resolution of one hour. Elements include hourly total precipitation, temperature, specific humidity, terrain-level pressure, downward longwave and shortwave radiation, and west-east and south-north wind components. It spans the period from 1979 across the Continental U.S. (CONUS) and from 1981 across Alaska, to the near-present (at all locations). This suite of eight variables is sufficient to drive most land-surface and hydrologic models and is used as input to the National Water Model (NWM) retrospective simulation. While the native AORC process generates netCDF output, the data is post-processed to create a cloud optimized Zarr formatted equivalent for dissemination using cloud technology and infrastructure.

    AORC Version 1.1 dataset creation
    The AORC dataset was created after reviewing, identifying, and processing multiple large-scale, observation, and analysis datasets. There are two versions of The Analysis Of Record for Calibration (AORC) data.

    The initial AORC Version 1.0 dataset was completed in November 2019 and consisted of a grid with 8 elements at a resolution of 30 arc seconds. The AORC version 1.1 dataset was created to address issues "see Table 1 in Fall et al., 2023" in the version 1.0 CONUS dataset. Full documentation on version 1.1 of the AORC data and the related journal publication are provided below.

    The native AORC version 1.1 process creates a dataset that consists of netCDF files with the following dimensions: 1 hour, 4201 latitude values (ranging from 25.0 to 53.0), and 8401 longitude values (ranging from -125.0 to -67).

    The data creation runs with a 10-day lag to ensure the inclusion of any corrections to the input Stage IV and NLDAS data.

    Note - The full extent of the AORC grid as defined in its data files exceed those cited above; those outermost rows and columns of data grids are filled with missing values and are the remnant of an early set of required AORC extents that have since been adjusted inward.

    AORC Version 1.1 Zarr Conversion

    The goal for converting the AORC data from netCDF to Zarr was to allow users to quickly and efficiently load/use the data. For example, one year of data takes 28 mins to load via NetCDF while only taking 3.2 seconds to load via Zarr (resulting in a substantial increase in speed). For longer periods of time, the percentage increase in speed using Zarr (vs NetCDF) is even higher. Using Zarr also leads to less memory and CPU utilization.

    It was determined that the optimal conversion for the data was 1 year worth of Zarr files with a chunk size of 18MB. The chunking was completed across all 8 variables. The chunks consist of the following dimensions: 144 time, 128 latitude, and 256 longitude. To create the files in the Zarr format, the NetCDF files were rechunked using chunk() and "Xarray". After chunking the files, they were converted to a monthly Zarr file. Then, each monthly Zarr file was combined using "to_zarr" to create a Zarr file that represents a full year

    Users wanting more than 1 year of data will be able to utilize Zarr utilities/libraries to combine multiple years up to the span of the full data set.

    There are eight variables representing the meteorological conditions
    Total Precipitaion (APCP_surface)

    1. Hourly total precipitation (kgm-2 or mm) for Calibration (AORC) dataset
    Air Temperature (TMP_2maboveground)
    1. Temperature (at 2 m above-ground-level (AGL)) (K)
    Specific Humidity (SPFH_2maboveground)
    1. Specific humidity (at 2 m AGL) (g g-1)
    Downward Long-Wave Radiation Flux (DLWRF_surface)
    1. longwave (infrared)
    2. radiation flux (at the surface) (W m-2)
    Downward Short-Wave Radiation Flux (DSWRF_surface)
    1. Downward shortwave (solar)
    2. radiation flux (at the surface) (W m-2)
    Pressure (PRES_surface)
    1. Air pressure (at the surface) (Pa)
    **U-Component of Wind (UGRD_10maboveground)"
    1)U (west-east) - components of the wind (at 10 m AGL) (m s-1)
    **V-Component of Wind (VGRD_10maboveground)"
    1. V (south-north) - components of the wind (at 10 m AGL) (m s-1)

    Precipitation and Temperature

    The gridded AORC precipitation dataset contains one-hour Accumulated Surface Precipitation (APCP) ending at the “top” of each hour, in liquid water-equivalent units (kg m-2 to the nearest 0.1 kg m-2), while the gridded AORC temperature dataset is comprised of instantaneous, 2 m above-ground-level (AGL) temperatures at the top of each hour (in Kelvin, to the nearest 0.1).

    Specific Humidity, Pressure, Downward Radiation, Wind

    The development process for the six additional dataset components of the Conus AORC [i.e., specific humidity at 2m above ground (kg kg-1); downward longwave and shortwave radiation fluxes at the surface (W m-2); terrain-level pressure (Pa); and west-east and south-north wind components at 10 m above ground (m s-1)] has two distinct periods, based on datasets and methodology applied: 1979–2015 and 2016–present.

  16. High-resolution cone-beam scan of two pomegranates with two dosage levels

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jan 24, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sophia Bethany Coban; Marinus J. Lagerwerf; K. Joost Batenburg; Sophia Bethany Coban; Marinus J. Lagerwerf; K. Joost Batenburg (2020). High-resolution cone-beam scan of two pomegranates with two dosage levels [Dataset]. http://doi.org/10.5281/zenodo.1475263
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sophia Bethany Coban; Marinus J. Lagerwerf; K. Joost Batenburg; Sophia Bethany Coban; Marinus J. Lagerwerf; K. Joost Batenburg
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We release tomographic scans of two pomegranates with two levels of radiation dosage of two measured objects for noise-level comparative studies in data analysis, reconstruction or segmentation methods. The dataset collected with higher dosage is referred to as the "good" dataset; and the other as the "noisy" dataset, as a way to distinguish between the two dosage levels.

    The dataset are acquired using the custom built and highly flexible CT scanner, FlexRay Lab, developed by XRE NV and located at CWI. This apparatus consists of a cone-beam microfocus X-ray point source that projects polychromatic X-rays onto a 1943-by-1535 pixels, 14-bit, flat detector panel.

    Both dataset were collected over a 360 degrees in circular and continuous motion with 2001 projections distributed evenly over the full circle for the good dataset and 501 projections distributed evenly over the full circle for the noisy dataset. The uploaded dataset are not binned or normalized; a single dark and two (pre- and post-) flat fields are included for each scan. Projections for both sets were collected with 100 ms exposure time with the good data projections averaged over 5 takes, and no averaging was made for the noisy data. The tube settings for the good and noisy dataset were 70kV, 45W and 70kV, 20W, respectively. The total scanning time were 20 minutes for the good; 3 minutes for the noisy scan. Each dataset is packaged with the full list of data and scan settings files (in .txt format). These files contain the tube settings, scan geometry and full list of motor settings.

    These dataset are produced by the Computational Imaging members at Centrum Wiskunde & Informatica (CI-CWI). For any useful Python/MATLAB scripts for FlexRay dataset, we refer the reader to our group's GitHub page.

    For more information or guidance in using these dataset, please get in touch with

    • s.b.coban [at] cwi.nl or
    • m.j.lagerwerf [at] cwi.nl
  17. p

    Global Merged Multi-Mission Hourly Gridded Wind Level 4 Dataset (2010-2020)...

    • pigma.org
    • sextant.ifremer.fr
    doi, www:ftp +1
    Updated Apr 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CERSAT Exploitation (2023). Global Merged Multi-Mission Hourly Gridded Wind Level 4 Dataset (2010-2020) for ESA MAXSS Project [Dataset]. https://www.pigma.org/geonetwork/srv/api/records/35002607-3546-412b-8c5d-9c182a16ffea
    Explore at:
    www:link, www:ftp, doiAvailable download formats
    Dataset updated
    Apr 5, 2023
    Dataset provided by
    CERSAT Exploitation
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 1, 2010 - Dec 31, 2020
    Area covered
    Description

    The Level 4 merged microwave wind product is a complete set of hourly global 10-m wind maps on a 0.25x0.25 degree latitude-longitude grid, spanning 1 Jan 2010 through the end of 2020. The product combines background neutral equivalent wind fields from ERA5, daily surface current fields from CMEMS, and stress equivalent winds obtained from several microwave passive and active sensors to produce hourly surface current relative stress equivalent wind analyses. The satellite winds include those from recently launched L-band passive sensors capable of measuring extreme winds in tropical cyclones, with little or no degradation from precipitation. All satellite winds used in the analyses have been recalibrated using a large set of collocated satellite-SFMR wind data in storm-centric coordinates. To maximize the use of the satellite microwave data, winds within a 24-hour window centered on the analysis time have been incorporated into each analysis. To accomodate the large time window, satellite wind speeds are transformed into deviations from ERA5 background wind speeds interpolated to the measurement times, and then an optical flow-based morphing technique is applied to these wind speed increments to propagate them from measurement to analysis time. These morphed wind speed increments are then added to the background wind speed at the analysis time to yield a set of total wind speeds fields for each sensor at the analysis time. These individual sensor wind speed fields are then combined with the background 10-m wind direction to yield vorticity and divergence fields for the individual sensor winds. From these, merged vorticity and divergence fields are computed as a weighted average of the individual vorticity and divergence fields. The final vector wind field is then obtained directly from these merged vorticity and divergence fields. Note that one consequence of producing the analyses in terms of vorticity and divergence is that there are no discontinuities in the wind speed fields at the (morphed) swath edges. There are two important points to be noted: the background ERA5 wind speed fields have been rescaled to be globally consistent with the recalibrated AMSR2 wind speeds. This rescaling involves a large increase in the ERA5 background winds beyond about 17 m/s. For example, an ERA5 10 m wind speed of 30 m/s is transformed into a wind speed of 41 m/s, and a wind speed of 34 m/s is transformed into a wind speed of about 48 m/s. Besides the current version of the product is calibrated for use within tropical cyclones and is not appropriate for use elsewhere. This dataset was produced in the frame of ESA MAXSS project. The primary objective of the ESA Marine Atmosphere eXtreme Satellite Synergy (MAXSS) project is to provide guidance and innovative methodologies to maximize the synergetic use of available Earth Observation data (satellite, in situ) to improve understanding about the multi-scale dynamical characteristics of extreme air-sea interaction.

  18. P

    PARIS Dataset Dataset

    • paperswithcode.com
    Updated May 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jiayi Liu; Ali Mahdavi-Amiri; Manolis Savva (2024). PARIS Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/paris-dataset
    Explore at:
    Dataset updated
    May 11, 2024
    Authors
    Jiayi Liu; Ali Mahdavi-Amiri; Manolis Savva
    Description

    From PARIS: Part-level Reconstruction and Motion Analysis for Articulated Objects: 5.1. Dataset Synthetic dataset. The synthetic 3D models we use for evaluation are from the PartNet-Mobility dataset [49, 27, 4], a large-scale dataset for articulated objects across 46 categories. We select instances across 10 categories to conduct our experiments. For each articulation state, we randomly sample 64-100 views covering the upper hemisphere of the object to simulate capturing in the real world. Then we render RGB images and acquire camera parameters and object masks using Blender [6] to create our training data. Real-world dataset. The real data we use for experiments is from the MultiScan dataset [25], scanning real-world indoor scenes with articulated objects in multiple states. We use the reconstructed mesh of an object in two states as ground truth for evaluation, and the real RGB frames as training data.

    From the project GitHub: We release both synthetic and real data shown in the paper here. Once downloaded, folders data and load should be put directly under the project directory. If you find it slow to download the data from our server, please try this OneDrive link instead.

  19. d

    A collocated unstructured finite volume Level Set / Front Trackingmethod for...

    • b2find.dkrz.de
    Updated Oct 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). A collocated unstructured finite volume Level Set / Front Trackingmethod for two-phase flows with large density-ratios - data - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/7e72e353-9303-5d55-a779-bb78bc1eb14e
    Explore at:
    Dataset updated
    Oct 21, 2023
    Description

    A collocated unstructured finite volume Level Set / Front Trackingmethod for two-phase flows with large density-ratios - data - Dataset - B2FIND

  20. u

    Data from: Comparison of CAMS and CMAQ analyses of surface-level PM2.5 and...

    • rda.ucar.edu
    • data.ucar.edu
    • +1more
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Comparison of CAMS and CMAQ analyses of surface-level PM2.5 and O3 over the conterminous United States (CONUS) [Dataset]. https://rda.ucar.edu/lookfordata/datasets/?nb=y&b=topic&v=Atmosphere
    Explore at:
    Description

    In this study, we compare the performance of the analysis time series over the period of August 2020 to December 2021 at EPA AirNow stations for both PM2.5 and O3 ... from raw Copernicus Atmosphere Monitoring Service (CAMS) reanalyses (CAMS RA Raw), raw CAMS near real-time forecasts (CAMS FC Raw), raw near real-time Community Multi-scale Air Quality (CMAQ) forecasts (CMAQ FC Raw), bias-corrected CAMS forecasts (CAMS FC BC), and bias-corrected CMAQ forecasts (CMAQ FC BC). This 17-month period spans two wildfire seasons, to assess model analysis performance in high-end AQ events. In addition to determining the best-performing gridded product, this process allows us to benchmark the performance of CMAQ forecasts against other global datasets (CAMS reanalysis and forecasts). For both PM2.5 and O3, the bias correction algorithm employed here greatly improved upon the raw model time series, and CMAQ FC BC was the best-performing model analysis time series, having the lowest RMSE, smallest bias error, and largest critical success index at multiple thresholds.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Tara Safavi; Danai Koutra, CoDEx Large Dataset [Dataset]. https://paperswithcode.com/dataset/codex-large

CoDEx Large Dataset

Explore at:
53 scholarly articles cite this dataset (View in Google Scholar)
Authors
Tara Safavi; Danai Koutra
Description

CoDEx comprises a set of knowledge graph completion datasets extracted from Wikidata and Wikipedia that improve upon existing knowledge graph completion benchmarks in scope and level of difficulty. CoDEx comprises three knowledge graphs varying in size and structure, multilingual descriptions of entities and relations, and tens of thousands of hard negative triples that are plausible but verified to be false.

Search
Clear search
Close search
Google apps
Main menu