100+ datasets found
  1. Functional Use Database (FUse)

    • catalog.data.gov
    • datasets.ai
    • +1more
    Updated Nov 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2020). Functional Use Database (FUse) [Dataset]. https://catalog.data.gov/dataset/functional-use-database-fuse
    Explore at:
    Dataset updated
    Nov 12, 2020
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    There are five different files for this dataset: 1. A dataset listing the reported functional uses of chemicals (FUse) 2. All 729 ToxPrint descriptors obtained from ChemoTyper for chemicals in FUse 3. All EPI Suite properties obtained for chemicals in FUse 4. The confusion matrix values, similarity thresholds, and bioactivity index for each model. 5. The functional use prediction, bioactivity index, and prediction classification (poor prediction, functional substitute, candidate alternative) for each Tox21 chemical. This dataset is associated with the following publication: Phillips, K., J. Wambaugh, C. Grulke, K. Dionisio, and K. Isaacs. High-throughput screening of chemicals as functional substitutes using structure-based classification models. GREEN CHEMISTRY. Royal Society of Chemistry, Cambridge, UK, 19: 1063-1074, (2017).

  2. H

    Data from: Agri-Food System Water Use Database

    • dataverse.harvard.edu
    • search.dataone.org
    Updated Jun 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    International Food Policy Research Institute (IFPRI) (2024). Agri-Food System Water Use Database [Dataset]. http://doi.org/10.7910/DVN/FZK8WE
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 4, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    International Food Policy Research Institute (IFPRI)
    License

    https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.7910/DVN/FZK8WEhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.7910/DVN/FZK8WE

    Time period covered
    2022
    Description

    This database provides information about the amount of water use in agriculture food systems covering all sectors from farming to food processing industries. The data are presented at the country level with sectoral disaggregation following the Nexus Social Accounting Matrix (SAM) sectoral specifications. The database also differentiates the type of water in each sector based on water sources. The green water refers to type of water originated from precipitation or rain, while the blue water refers to all water that comes from irrigation covering both surface and groundwater. Both types of water are consumed by plants or animals during the production process. The grey water on the other hand is the amount of water generated as an implication from production activities that cause the water polluted. Since it has loads of pollutants created from production activities, this type of water can be seen as a waste in the whole production system.

  3. O*NET Database

    • onetcenter.org
    excel, mysql, oracle +2
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Center for O*NET Development, O*NET Database [Dataset]. https://www.onetcenter.org/database.html
    Explore at:
    oracle, sql server, text, mysql, excelAvailable download formats
    Dataset provided by
    Occupational Information Network
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    United States
    Dataset funded by
    US Department of Labor, Employment and Training Administration
    Description

    The O*NET Database contains hundreds of standardized and occupation-specific descriptors on almost 1,000 occupations covering the entire U.S. economy. The database, which is available to the public at no cost, is continually updated by a multi-method data collection program. Sources of data include: job incumbents, occupational experts, occupational analysts, employer job postings, and customer/professional association input.

    Data content areas include:

    • Worker Characteristics (e.g., Abilities, Interests, Work Styles)
    • Worker Requirements (e.g., Education, Knowledge, Skills)
    • Experience Requirements (e.g., On-the-Job Training, Work Experience)
    • Occupational Requirements (e.g., Detailed Work Activities, Work Context)
    • Occupation-Specific Information (e.g., Job Titles, Tasks, Technology Skills)

  4. Dofus Database

    • kaggle.com
    zip
    Updated Aug 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PostMortem (2022). Dofus Database [Dataset]. https://www.kaggle.com/datasets/pstmrtem/dofus-dabase
    Explore at:
    zip(1552056 bytes)Available download formats
    Dataset updated
    Aug 6, 2022
    Authors
    PostMortem
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dofus Database

    How it has been gathered

    This dataset has been produced by scrapping the encyclopedia of Dofus' website.

    It was first done as a challenge, and a way to gather some textual data about the game. You can find the code used to scrap and parse the data here : https://github.com/Futurne/dofus_scrap.

    Quick presentation

    The main files are the json files, where you will find all scrapped items of the game from the encyclopedia. Those files are named after their categories in the encyclopedia (e.g. you will find all weapons in the armes.json file). You can explore those files, they are pretty self explanatory.

    Another dataset here is the almanax.csv, which is just a dataset of every descriptions of the almanax, scrapped for a whole year. For each day, you will find the boss, rubrikabrax and meryde description.

    How to use it

    You can use this dataset to finetune a pretrained french model by getting all textual informations (in the almanax.csv and in the json files). The items have a description property that can be gathered as a big NLP dataset.

    You can also do some data analysis : will you find what items are overpowered ? Are the harder to craft ?

    Finally, one other idea would be to build an automatic stuff optimizer. You could ask for the best dofus stuff you could have, that maximizes one element while satisfying some constraints.

    Note that I ended up finding that there is already a non-official (just like my dataset) API allowing one to get all data. You can checkout their project here : https://dofapi.fr/. They seem not to have updated their project for a long time now though.

  5. DataSheet2_Data Sources for Drug Utilization Research in Brazil—DUR-BRA...

    • frontiersin.figshare.com
    • datasetcatalog.nlm.nih.gov
    xlsx
    Updated Jun 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lisiane Freitas Leal; Claudia Garcia Serpa Osorio-de-Castro; Luiz Júpiter Carneiro de Souza; Felipe Ferre; Daniel Marques Mota; Marcia Ito; Monique Elseviers; Elisangela da Costa Lima; Ivan Ricardo Zimmernan; Izabela Fulone; Monica Da Luz Carvalho-Soares; Luciane Cruz Lopes (2023). DataSheet2_Data Sources for Drug Utilization Research in Brazil—DUR-BRA Study.xlsx [Dataset]. http://doi.org/10.3389/fphar.2021.789872.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Frontiers Mediahttp://www.frontiersin.org/
    Authors
    Lisiane Freitas Leal; Claudia Garcia Serpa Osorio-de-Castro; Luiz Júpiter Carneiro de Souza; Felipe Ferre; Daniel Marques Mota; Marcia Ito; Monique Elseviers; Elisangela da Costa Lima; Ivan Ricardo Zimmernan; Izabela Fulone; Monica Da Luz Carvalho-Soares; Luciane Cruz Lopes
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Brazil
    Description

    Background: In Brazil, studies that map electronic healthcare databases in order to assess their suitability for use in pharmacoepidemiologic research are lacking. We aimed to identify, catalogue, and characterize Brazilian data sources for Drug Utilization Research (DUR).Methods: The present study is part of the project entitled, “Publicly Available Data Sources for Drug Utilization Research in Latin American (LatAm) Countries.” A network of Brazilian health experts was assembled to map secondary administrative data from healthcare organizations that might provide information related to medication use. A multi-phase approach including internet search of institutional government websites, traditional bibliographic databases, and experts’ input was used for mapping the data sources. The reviewers searched, screened and selected the data sources independently; disagreements were resolved by consensus. Data sources were grouped into the following categories: 1) automated databases; 2) Electronic Medical Records (EMR); 3) national surveys or datasets; 4) adverse event reporting systems; and 5) others. Each data source was characterized by accessibility, geographic granularity, setting, type of data (aggregate or individual-level), and years of coverage. We also searched for publications related to each data source.Results: A total of 62 data sources were identified and screened; 38 met the eligibility criteria for inclusion and were fully characterized. We grouped 23 (60%) as automated databases, four (11%) as adverse event reporting systems, four (11%) as EMRs, three (8%) as national surveys or datasets, and four (11%) as other types. Eighteen (47%) were classified as publicly and conveniently accessible online; providing information at national level. Most of them offered more than 5 years of comprehensive data coverage, and presented data at both the individual and aggregated levels. No information about population coverage was found. Drug coding is not uniform; each data source has its own coding system, depending on the purpose of the data. At least one scientific publication was found for each publicly available data source.Conclusions: There are several types of data sources for DUR in Brazil, but a uniform system for drug classification and data quality evaluation does not exist. The extent of population covered by year is unknown. Our comprehensive and structured inventory reveals a need for full characterization of these data sources.

  6. HCUP California

    • stanford.redivis.com
    • redivis.com
    application/jsonl +7
    Updated May 20, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stanford Center for Population Health Sciences (2020). HCUP California [Dataset]. http://doi.org/10.57761/krfh-m184
    Explore at:
    application/jsonl, arrow, parquet, sas, avro, spss, csv, stataAvailable download formats
    Dataset updated
    May 20, 2020
    Dataset provided by
    Redivis Inc.
    Authors
    Stanford Center for Population Health Sciences
    Time period covered
    Jan 1, 2008 - Dec 31, 2011
    Area covered
    California
    Description

    Abstract

    The State Ambulatory Surgery Databases (SASD), State Inpatient Databases (SID), and State Emergency Department Databases (SEDD) are part of a family of databases and software tools developed for the Healthcare Cost and Utilization Project (HCUP).

    HCUP's state-specific databases can be used to investigate state-specific and multi-state trends in health care utilization, access, charges, quality, and outcomes. PHS has several years (2008-2011) and datasets (SASSD, SED and SIDD) for HCUP California available.

    Usage

    The State Ambulatory Surgery and Services Databases (SASD) are State-specific files that include data for ambulatory surgery and other outpatient services from hospital-owned facilities. In addition, some States provide ambulatory surgery and outpatient services from nonhospital-owned facilities. The uniform format of the SASD helps facilitate cross-State comparisons. The SASD are well suited for research that requires complete enumeration of hospital-based ambulatory surgeries within geographic areas or States.

    The State Inpatient Databases (SID) are State-specific files that contain all inpatient care records in participating states. Together, the SID encompass more than 95 percent of all U.S. hospital discharges. The uniform format of the SID helps facilitate cross-state comparisons. In addition, the SID are well suited for research that requires complete enumeration of hospitals and discharges within geographic areas or states.

    The State Emergency Department Databases (SEDD) are a set of longitudinal State-specific emergency department (ED) databases included in the HCUP family. The SEDD capture discharge information on all emergency department visits that do not result in an admission. Information on patients seen in the emergency room and then admitted to the hospital is included in the State Inpatient Databases (SID)

    SASD, SID, and SEDD each have **Documentation **which includes:

    • Description of the Database
    • Restrictions on Use
    • File Specifications and Load Program
    • Data Elements
    • Additional Resources for Data Elements
    • ICD-10-CM/PCS Data Included in the Dataset Starting with 2015
    • Known Data Issues
    • HCUP Tools: Labels and Formats
    • HCUP Supplemental Files
    • Obtaining HCUP Data

    %3C!-- --%3E

    Before Manuscript Submission

    All manuscripts (and other items you'd like to publish) must be submitted to

    phsdatacore@stanford.edu for approval prior to journal submission.

    We will check your cell sizes and citations.

    For more information about how to cite PHS and PHS datasets, please visit:

    https:/phsdocs.developerhub.io/need-help/citing-phs-data-core

    Documentation

    The HCUP California inpatient files were constructed from the confidential files received from the Office of Statewide Health Planning and Development (OSHPD). OSHPD excluded inpatient stays that, after processing by OSHPD, did not contain a complete and “in-range” admission date or discharge date. California also excluded inpatient stays that had an unknown or missing date of birth. OSHPD removes ICD-9-CM and ICD-10-CM diagnoses codes for HIV test results. Beginning with 2009 data, OSHPD changed regulations to require hospitals to report all external cause of injury diagnosis codes including those specific to medical misadventures. Prior to 2009, OSHPD did not require collection of diagnosis codes identifying medical misadventures.

    **Types of Facilities Included in the Files Provided to HCUP by the Partner **

    California supplied discharge data for inpatient stays in general acute care hospitals, acute psychiatric hospitals, chemical dependency recovery hospitals, psychiatric health facilities, and state operated hospitals. A comparison of the number of hospitals included in the SID and the number of hospitals reported in the AHA Annual Survey is available starting in data year 2010. Hospitals do not always report data for a full calendar year. Some hospitals open or close during the year; other hospitals have technical problems that prevent them from reporting data for all months in a year.

    **Inclusion of Stays in Special Units **

    Included with the general acute care stays are stays in skilled nursing, intermediate care, rehabilitation, alcohol/chemical dependency treatment, and psychiatric units of hospitals in California. How the stays in these different types of units can be identified differs by data year. Beginning in 2006, the information is retained in the HCUP variable HOSPITALUNIT. Reliability of this indicator for the level of care depends on how it was assigned by the hospital. For data years 1998-2006, the information was retained in the HCUP variable LEVELCARE. Prior to 1998, the first

  7. Time Series International Database: International Populations by Single Year...

    • catalog.data.gov
    • s.cnmilf.com
    Updated Sep 30, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Census Bureau (2025). Time Series International Database: International Populations by Single Year of Age and Sex [Dataset]. https://catalog.data.gov/dataset/international-data-base-time-series-international-database-international-populations-by-si
    Explore at:
    Dataset updated
    Sep 30, 2025
    Dataset provided by
    United States Census Bureauhttp://census.gov/
    Description

    Midyear population estimates and projections for all countries and areas of the world with a population of 5,000 or more // Source: U.S. Census Bureau, Population Division, International Programs Center// Note: Total population available from 1950 to 2100 for 227 countries and areas. Other demographic variables available from base year to 2100. Base year varies by country and therefore data are not available for all years for all countries. See methodologyhttps://www.census.gov/programs-surveys/international-programs/about/idb.html

  8. Database & Directory Publishing in the US - Market Research Report...

    • ibisworld.com
    Updated Nov 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    IBISWorld (2024). Database & Directory Publishing in the US - Market Research Report (2015-2030) [Dataset]. https://www.ibisworld.com/united-states/market-research-reports/database-directory-publishing-industry/
    Explore at:
    Dataset updated
    Nov 15, 2024
    Dataset authored and provided by
    IBISWorld
    License

    https://www.ibisworld.com/about/termsofuse/https://www.ibisworld.com/about/termsofuse/

    Time period covered
    2014 - 2029
    Description

    With the phone book era far in the past, database and directory publishers have been forced to transform their business approach, focusing on their digital presence. Despite many publishers rapidly moving away from print services, they are experiencing immovable competition from online search engines and social media platforms within the digital space, negatively affecting revenue growth potential. Industry revenue has been eroding at a CAGR of 4.4% over the past five years and in 2024, a 3.9% drop has led to the industry revenue totaling $4.4 billion. Profit continues to drop in line with revenue, accounting for 4.7% of revenue as publishers invest more in their digital platforms. Interest in printed directories has disappeared as institutional clients and consumers have continued their shift to convenient online resources. Declining demand for print advertising has curbed revenue growth and online revenue has only slightly mitigated this downturn. Though many traditional publishers, such as Yellow Pages, now operate under parent companies with digital resources, directory publishers remain low on the list of options businesses have to choose from in digital advertising. Due to the convenience and connectivity that Facebook and Google services offer, traditional directory publishers have a limited ability to compete. Many providers have rebranded and tailored their services toward client needs, though these efforts have only had a marginal impact on revenue growth. The industry is forecast to decline at an accelerated CAGR of 5.2% over the next five years, reaching an estimated $3.4 billion in 2029, as businesses and consumers continually turn to digital alternatives for information and advertising opportunities. As AI and digital technology innovation expands, social media company products will likely improve at a faster rate than the digital offerings that directory publishers can provide. Though these companies will seek external partnerships to cut costs, they face an uphill battle to boost their visibility and reverse consumer habit trends.

  9. d

    Habitat Use Database - Groundfish Essential Fish Habitat (EFH) Habitat Use...

    • catalog.data.gov
    • fisheries.noaa.gov
    Updated May 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (Point of Contact, Custodian) (2025). Habitat Use Database - Groundfish Essential Fish Habitat (EFH) Habitat Use Database (HUD) [Dataset]. https://catalog.data.gov/dataset/habitat-use-database-groundfish-essential-fish-habitat-efh-habitat-use-database-hud3
    Explore at:
    Dataset updated
    May 24, 2025
    Dataset provided by
    (Point of Contact, Custodian)
    Description

    The Habitat Use Database (HUD) was specifically designed to address the need for habitat-use analyses in support of groundfish EFH, HAPCs, and fishing and nonfishing impacts components of the 2005 EFH EIS. HUD functionality and accessibility, and the ecological information upon which the HUD is based, will be improved in order for this database to fully support fisheries and ecosystem science and management. Upgrades to and applications of the HUD will be facilitated through a series of prioritized phases: • Fully integrate the data entry, quality control, and reporting capabilities from the original HUD Access database with a web-based and programmatic interface. Improve software for HUD to accommodate the most current habitat maps and habitat classification codes. This will be achieved by NMFS in consultation with HUD architects at Oregon State University. • Review and update the biological and ecological information in the HUD. • Develop and apply improved models that will be used to create updated habitat suitability maps for all west coast groundfish species using the updated HUD and Pacific coast seafloor habitat maps. • Integrate habitat suitability models with the online groundfish EFH data catalog (http://efh-catalog.coas.oregonstate.edu/overview/). 2005 habitat-use analysis supporting groundfish EFH.

  10. d

    3D-Genomics Database

    • dknet.org
    • scicrunch.org
    • +2more
    Updated Jan 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). 3D-Genomics Database [Dataset]. http://identifiers.org/RRID:SCR_007430
    Explore at:
    Dataset updated
    Jan 29, 2022
    Description

    THIS RESOURCE IS NO LONGER IN SERVICE, documented August 29, 2016. Database containing structural annotations for the proteomes of just under 100 organisms. Using data derived from public databases of translated genomic sequences, representatives from the major branches of Life are included: Prokaryota, Eukaryota and Archaea. The annotations stored in the database may be accessed in a number of ways. The help page provides information on how to access the database. 3D-GENOMICS is now part of a larger project, called e-Protein. The project brings together similar databases at three sites: Imperial College London , University College London and the European Bioinformatics Institute . e-Protein''s mission statement is To provide a fully automated distributed pipeline for large-scale structural and functional annotation of all major proteomes via the use of cutting-edge computer GRID technologies. The following databases are incorporated: NRprot, SCOP, ASTRAL, PFAM, Prosite, taxonomy, COG The following eukaryotic genomes are incorporated: Anopheles gambiae, protein sequences from the mosquito genome; Arabidopsis thaliana, protein sequences from the Arabidopsis genome; Caenorhabditis briggsae, protein sequences from the C.briggsae genome; Caenorhabditis elegans protein sequences from the worm genome; Ciona intestinalis protein sequences from the sea squirt genome; Danio rerio protein sequences from the zebrafish genome; Drosophila melanogaster protein sequences from the fruitfly genome; Encephalitozoon cuniculi protein sequences from the E.cuniculi genome; Fugu rubripes protein sequences from the pufferfish genome; Guillardia theta protein sequences from the G.theta genome; Homo sapiens protein sequences from the human genome; Mus musculus protein sequences from the mouse genome; Neurospora crassa protein sequences from the N.crassa genome; Oryza sativa protein sequences from the rice genome; Plasmodium falciparum protein sequences from the P.falciparum genome; Rattus norvegicus protein sequences from the rat genome; Saccharomyces cerevisiae protein sequences from the yeast genome; Schizosaccharomyces pombe protein sequences from the yeast genome

  11. Z

    TetrapodTraits Database

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Oct 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Moura, Mario R.; Ceron, Karoline; Guedes, Jhonny J. M.; Chen-Zhao, Rosana; Sica, Yanina; Hart, Julie; Dorman, Wendy; Portmann, Julia M.; Gonzalez-del-Pliego, Pamela; Ranipeta, Ajay; Catenazzi, Alessandro; Werneck, Fernanda; Toledo, Luis Felipe; Upham, Nathan; Tonini, Joao F. R.; Colston, Timothy J.; Guralnick, Robert; Bowie, Rauri C. K.; Pyron, R. Alexander; Jetz, Walter (2024). TetrapodTraits Database [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10530617
    Explore at:
    Dataset updated
    Oct 9, 2024
    Dataset provided by
    University of Illinois Urbana-Champaign
    National Institute of Amazonian Research
    Florida International University
    University of California, Berkeley
    State University of New York
    Universidade Federal de Goiás
    Yale University
    University of Richmond
    George Washington University
    Universidade de Évora
    University of Puerto Rico-Mayaguez
    Universidade Federal do Ceará
    Arizona State University
    Universidade Estadual de Campinas (UNICAMP)
    University of Florida
    Authors
    Moura, Mario R.; Ceron, Karoline; Guedes, Jhonny J. M.; Chen-Zhao, Rosana; Sica, Yanina; Hart, Julie; Dorman, Wendy; Portmann, Julia M.; Gonzalez-del-Pliego, Pamela; Ranipeta, Ajay; Catenazzi, Alessandro; Werneck, Fernanda; Toledo, Luis Felipe; Upham, Nathan; Tonini, Joao F. R.; Colston, Timothy J.; Guralnick, Robert; Bowie, Rauri C. K.; Pyron, R. Alexander; Jetz, Walter
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract

    Tetrapods (amphibians, reptiles, birds and mammals) are model systems for global biodiversity science, but continuing data gaps, limited data standardisation, and ongoing flux in taxonomic nomenclature constrain integrative research on this group and potentially cause biassed inference. We combined and harmonised taxonomic, spatial, phylogenetic, and attribute data with phylogeny-based multiple imputation to provide a comprehensive data resource (TetrapodTraits 1.0.0) that includes values, predictions, and sources for body size, activity time, micro- and macrohabitat, ecosystem, threat status, biogeography, insularity, environmental preferences and human influence, for all 33,281 tetrapod species covered in recent fully sampled phylogenies. We assess gaps and biases across taxa and space, finding that shared data missing in attribute values increased with taxon-level completeness and richness across clades. Prediction of missing attribute values using multiple imputation revealed substantial changes in estimated macroecological patterns. These results highlight biases incurred by non-random missingness and strategies to best address them. While there is an obvious need for further data collection and updates, our phylogeny-informed database of tetrapod traits can support a more comprehensive representation of tetrapod species and their attributes in ecology, evolution, and conservation research.

    Additional Information: This work is output of the VertLife project. To flag erros, provide updates, or leave other comments, please go to vertlife.org. We aim to develop the database into a living resource at vertlife.org and your feedback is essential to improve data quality and support community use.

    Version 1.0.1 (25 May 2024). This minor release addresses a spelling error in the file Tetrapod_360.csv. The error involves replacing white-space characters with underscore characters in the field Scientific.Name to match the spelling used in the file TetrapodTraits_1.0.0.csv. These corrections affect only 102 species considered extinct and 13 domestic species (Bos_frontalis, Bos_grunniens, Bos_indicus, Bos_taurus, Camelus_bactrianus, Camelus_dromedarius, Capra_hircus, Cavia_porcellus, Equus_caballus, Felis_catus, Lama_glama, Ovis_aries, Vicugna_pacos). All extinct and domestic species in TetrapodTraits have their binomial names separated by underscore symbols instead of white space. Additionally, we have added the file GridCellShapefile.zip, which contains the shapefile required to map species presence across the 110 × 110 km equal area grid cells (this file was previously provided through an External Source here).

    Version 1.0.0 (19 April 2024). TetrapodTraits, the full phylogenetically coherent database we developed, is being made publicly available to support a range of research applications in ecology, evolution, and conservation and to help minimise the impacts of biassed data in this model system. The database includes 24 species-level attributes linked to their respective sources across 33,281 tetrapod species. Specific fields clearly label data sources and imputations in the TetrapodTraits, while additional tables record the 10K values per missing entry per species.

    Taxonomy – includes 8 attributes that inform scientific names and respective higher-level taxonomic ranks, authority name, and year of species description. Field names: Scientific.Name, Genus, Family, Suborder, Order, Class, Authority, and YearOfDescription.

    Phylogenetic tree – includes 2 attributes that notify which fully-sampled phylogeny contains the species, along with whether the species placement was imputed or not in the phylogeny. Field names: TreeTaxon, TreeImputed.

    Body size – includes 7 attributes that inform length, mass, and data sources on species sizes, and details on the imputation of species length or mass. Field names: BodyLength_mm, LengthMeasure, ImputedLength, SourceBodyLength, BodyMass_g, ImputedMass, SourceBodyMass.

    Activity time – includes 5 attributes that describe period of activity (e.g., diurnal, fossorial) as dummy (binary) variables, data sources, details on the imputation of species activity time, and a nocturnality score. Field names: Diu, Noc, ImputedActTime, SourceActTime, Nocturnality.

    Microhabitat – includes 8 attributes covering habitat use (e.g., fossorial, terrestrial, aquatic, arboreal, aerial) as dummy (binary) variables, data sources, details on the imputation of microhabitat, and a verticality score. Field names: Fos, Ter, Aqu, Arb, Aer, ImputedHabitat, SourceHabitat, Verticality.

    Macrohabitat – includes 19 attributes that reflect major habitat types according to the IUCN classification, the sum of major habitats, data source, and details on the imputation of macrohabitat. Field names: MajorHabitat_1 to MajorHabitat_10, MajorHabitat_12 to MajorHabitat_17, MajorHabitatSum, ImputedMajorHabitat, SourceMajorHabitat. MajorHabitat_11, representing the marine deep ocean floor (unoccupied by any species in our database), is not included here.

    Ecosystem – includes 6 attributes covering species ecosystem (e.g., terrestrial, freshwater, marine) as dummy (binary) variables, the sum of ecosystem types, data sources, and details on the imputation of ecosystem. Field names: EcoTer, EcoFresh, EcoMar, EcosystemSum, ImputedEcosystem, SourceEcosystem.

    Threat status – includes 3 attributes that inform the assessed threat statuses according to IUCN red list and related literature. Field names: IUCN_Binomial, AssessedStatus, SourceStatus.

    RangeSize – the number of 110×110 grid cells covered by the species range map. Data derived from MOL.

    Latitude – coordinate centroid of the species range map.

    Longitude – coordinate centroid of the species range map.

    Biogeography – includes 8 attributes that present the proportion of species range within each WWF biogeographical realm. Field names: Afrotropic, Australasia, IndoMalay, Nearctic, Neotropic, Oceania, Palearctic, Antarctic.

    Insularity – includes 2 attributes that notify if a species is insular endemic (binary, 1 = yes, 0 = no), followed by the respective data source. Field names: Insularity, SourceInsularity.

    AnnuMeanTemp – Average within-range annual mean temperature (Celsius degree). Data derived from CHELSA v. 1.2.

    AnnuPrecip – Average within-range annual precipitation (mm). Data derived from CHELSA v. 1.2.

    TempSeasonality – Average within-range temperature seasonality (Standard deviation × 100). Data derived from CHELSA v. 1.2.

    PrecipSeasonality – Average within-range precipitation seasonality (Coefficient of Variation). Data derived from CHELSA v. 1.2.

    Elevation – Average within-range elevation (metres). Data derived from topographic layers in EarthEnv.

    ETA50K – Average within-range estimated time to travel to cities with a population >50K in the year 2015. Data from Nelson et al. (2019).

    HumanDensity – Average within-range human population density in 2017. Data derived from HYDE v. 3.2.

    PropUrbanArea – Proportion of species range map covered by built-up area, such as towns, cities, etc. at year 2017. Data derived from HYDE v. 3.2.

    PropCroplandArea – Proportion of species range map covered by cropland area, identical to FAO's category 'Arable land and permanent crops' at year 2017. Data derived from HYDE v. 3.2.

    PropPastureArea – Proportion of species range map covered by cropland, defined as Grazing land with an aridity index > 0.5, assumed to be more intensively managed (converted in climate models) at year 2017. Data derived from HYDE v. 3.2.

    PropRangelandArea – Proportion of species range map covered by rangeland, defined as Grazing land with an aridity index < 0.5, assumed to be less or not managed (not converted in climate models) at year 2017. Data derived from HYDE v. 3.2.

    File content

    All files use UTF-8 encoding.

    ImputedSets.zip – the phylogenetic multiple imputation framework applied to the TetrapodTraits database produced 10,000 imputed values per missing data entry (= 100 phylogenetic trees x 10 validation-folds x 10 multiple imputations). These imputations were specifically developed for four fundamental natural history traits: Body length, Body mass, Activity time, and Microhabitat. To facilitate the evaluation of each imputed value in a user-friendly format, we offer 10,000 tables containing both observed and imputed data for the 33,281 species in the TetrapodTraits database. Each table encompasses information about the four targeted natural history traits, along with designated fields (e.g., ImputedMass) that clearly indicate whether the trait value provided (e.g., BodyMass_g) corresponds to observed (e.g., ImputedMass = 0) or imputed (e.g., ImputedMass = 1) data. Given that the complete set of 10,000 tables necessitates nearly 17GB of storage space, we have organized sets of 1,000 tables into separate zip files to streamline the download process.

    ImputedSets_1K.zip, imputations for trees 1 to 10.

    ImputedSets_2K.zip, imputations for trees 11 to 20.

    ImputedSets_3K.zip, imputations for trees 21 to 30.

    ImputedSets_4K.zip, imputations for trees 31 to 40.

    ImputedSets_5K.zip, imputations for trees 41 to 50.

    ImputedSets_6K.zip, imputations for trees 51 to 60.

    ImputedSets_7K.zip, imputations for trees 61 to 70.

    ImputedSets_8K.zip, imputations for trees 71 to 80.

    ImputedSets_9K.zip, imputations for trees 81 to 90.

    ImputedSets_10K.zip, imputations for trees 91 to 100.

    TetrapodTraits_1.0.0.csv – the complete TetrapodTraits database, with missing data entries in natural history traits (body length, body mass, activity time, and microhabitat) replaced by the average across the 10K imputed values obtained through phylogenetic multiple imputation. Please note that imputed microhabitat (attribute fields: Fos, Ter, Aqu, Arb, Aer) and imputed activity time (attribute fields: Diu, Noc) are continuous variables within the 0-1 range interval. At the user's

  12. d

    Health and Retirement Study (HRS)

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Damico, Anthony (2023). Health and Retirement Study (HRS) [Dataset]. http://doi.org/10.7910/DVN/ELEKOY
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Damico, Anthony
    Description

    analyze the health and retirement study (hrs) with r the hrs is the one and only longitudinal survey of american seniors. with a panel starting its third decade, the current pool of respondents includes older folks who have been interviewed every two years as far back as 1992. unlike cross-sectional or shorter panel surveys, respondents keep responding until, well, death d o us part. paid for by the national institute on aging and administered by the university of michigan's institute for social research, if you apply for an interviewer job with them, i hope you like werther's original. figuring out how to analyze this data set might trigger your fight-or-flight synapses if you just start clicking arou nd on michigan's website. instead, read pages numbered 10-17 (pdf pages 12-19) of this introduction pdf and don't touch the data until you understand figure a-3 on that last page. if you start enjoying yourself, here's the whole book. after that, it's time to register for access to the (free) data. keep your username and password handy, you'll need it for the top of the download automation r script. next, look at this data flowchart to get an idea of why the data download page is such a righteous jungle. but wait, good news: umich recently farmed out its data management to the rand corporation, who promptly constructed a giant consolidated file with one record per respondent across the whole panel. oh so beautiful. the rand hrs files make much of the older data and syntax examples obsolete, so when you come across stuff like instructions on how to merge years, you can happily ignore them - rand has done it for you. the health and retirement study only includes noninstitutionalized adults when new respondents get added to the panel (as they were in 1992, 1993, 1998, 2004, and 2010) but once they're in, they're in - respondents have a weight of zero for interview waves when they were nursing home residents; but they're still responding and will continue to contribute to your statistics so long as you're generalizing about a population from a previous wave (for example: it's possible to compute "among all americans who were 50+ years old in 1998, x% lived in nursing homes by 2010"). my source for that 411? page 13 of the design doc. wicked. this new github repository contains five scripts: 1992 - 2010 download HRS microdata.R loop through every year and every file, download, then unzip everything in one big party impor t longitudinal RAND contributed files.R create a SQLite database (.db) on the local disk load the rand, rand-cams, and both rand-family files into the database (.db) in chunks (to prevent overloading ram) longitudinal RAND - analysis examples.R connect to the sql database created by the 'import longitudinal RAND contributed files' program create tw o database-backed complex sample survey object, using a taylor-series linearization design perform a mountain of analysis examples with wave weights from two different points in the panel import example HRS file.R load a fixed-width file using only the sas importation script directly into ram with < a href="http://blog.revolutionanalytics.com/2012/07/importing-public-data-with-sas-instructions-into-r.html">SAScii parse through the IF block at the bottom of the sas importation script, blank out a number of variables save the file as an R data file (.rda) for fast loading later replicate 2002 regression.R connect to the sql database created by the 'import longitudinal RAND contributed files' program create a database-backed complex sample survey object, using a taylor-series linearization design exactly match the final regression shown in this document provided by analysts at RAND as an update of the regression on pdf page B76 of this document . click here to view these five scripts for more detail about the health and retirement study (hrs), visit: michigan's hrs homepage rand's hrs homepage the hrs wikipedia page a running list of publications using hrs notes: exemplary work making it this far. as a reward, here's the detailed codebook for the main rand hrs file. note that rand also creates 'flat files' for every survey wave, but really, most every analysis you c an think of is possible using just the four files imported with the rand importation script above. if you must work with the non-rand files, there's an example of how to import a single hrs (umich-created) file, but if you wish to import more than one, you'll have to write some for loops yourself. confidential to sas, spss, stata, and sudaan users: a tidal wave is coming. you can get water up your nose and be dragged out to sea, or you can grab a surf board. time to transition to r. :D

  13. o

    Harmonized Database of Western U.S. Water Rights (HarDWR)

    • osti.gov
    Updated Nov 15, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Caccese, Robert; Fisher-Vanden, Karen; Fowler, Lara; Grogan, Danielle; Lammers, Richard; Lisk, Matthew; Olmstead, Sheila; Peklak, Darrah; Zheng, Jiameng; Zuidema, Shan (2023). Harmonized Database of Western U.S. Water Rights (HarDWR) [Dataset]. https://www.osti.gov/dataexplorer/biblio/dataset/2205619
    Explore at:
    Dataset updated
    Nov 15, 2023
    Dataset provided by
    MultiSector Dynamics - Living, Intuitive, Value-adding, Environment
    USDOE Office of Science (SC), Biological and Environmental Research (BER)
    Authors
    Caccese, Robert; Fisher-Vanden, Karen; Fowler, Lara; Grogan, Danielle; Lammers, Richard; Lisk, Matthew; Olmstead, Sheila; Peklak, Darrah; Zheng, Jiameng; Zuidema, Shan
    Area covered
    United States, Western United States
    Description

    From Lisk et al. (in review): "In the arid and semi-arid western U.S., access to water is regulated through a legal system of water rights. Individuals, companies, organizations, municipalities, and tribal entities have documents that declare their water rights. State water regulatory agencies collate and maintain these records, which can be used in legal disputes over access to water. While these records are publicly available data in all western U.S. states, the data have not yet been readily available in digital form from all states. Furthermore, there are many differences in data format, terminology, and definitions between state water regulatory agencies. Here, we have collected water rights data from 11 western U.S. state agencies, harmonized terminology and use definitions, formatted them consistently, and tied them to a western U.S.-wide shapefile of water administrative boundaries. We demonstrate how these data enable consistent regional-scale western U.S. hydrologic and economic modeling."

  14. r

    Arthropod Kraken2 Database v1

    • demo.researchdata.se
    • researchdata.se
    Updated Aug 18, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samantha López Clinton; Tom van der Valk (2025). Arthropod Kraken2 Database v1 [Dataset]. http://doi.org/10.17044/SCILIFELAB.29666605
    Explore at:
    Dataset updated
    Aug 18, 2025
    Dataset provided by
    Swedish Museum of Natural History
    Authors
    Samantha López Clinton; Tom van der Valk
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Kraken2 Arthopod Reference Database v.1Kraken2 (v2.1.2) database containing all 2,593 reference assemblies for Arthropoda available on NCBI as of March 2023.

    This database was built for and used in the analysis of shotgun sequencing data of bulkDNA from Malaise trap samples collected by the Insect Biome Atlas, in the context of the manuscript "Small Bugs, Big Data: Metagenomics for arthropod biodiversity monitoring" by authors: López Clinton Samantha, Iwaszkiewicz-Eggebrecht Ela, Miraldo Andreia, Goodsell Robert, Webster Mathew T, Ronquist Fredrik, van der Valk Tom (for submission to Ecology and Evolution).

    For custom database building, Kraken2 requires all headers in reference assembly fasta files to be annotated with "kraken:taxid|XXX" at the end of each header. Where "XXX" is the corresponding National Center for Biotechnology Information (NCBI) taxID of the species. The code used to add the taxID information to each fasta file header, and update the accession2taxid.map file required by Kraken2 for database building, is available in this GitHub repository (https://github.com/SamanthaLop/Small_Bugs_Big_Data) (also linked under "Related Materials" below).

    ContentBelow is a list of the files in this item (in addition to the README and MANIFEST files), and their description. The first three files (marked with a *) are required to run Kraken2 classifications using the database.

    • * hash.k2d.gz - A hash file with all minimiser to taxon mappings (855 GB).
    • * opts.k2d - A file containing all options used when building the Kraken2 database (64 B).
    • * taxo.k2d - A file containing the taxonomy information used to build the database (385.9 KB).
    • seqid2taxid.map.gz - A file containing contig accession numbers and their corresponding taxids (810.6 MB). Note that this file is needed by Kraken2 when building the database, and as it was updated during custom building, it has been included for reference, but it is not required to use the database for classification.
    • genome_assembly_metadata.tsv - NCBI-generated table (tsv format, gzipped) of all reference assemblies for Arthropoda as of March 2023, which were used in the database construction. This includes columns: Assembly Accession, Assembly Name, Organism Name, Organism Infraspecific Names Breed, Organism Infraspecific Names Strain, Organism Infraspecific Names Cultival, Organism Infraspecific Names Ecotype, Organism Infraspecific Names Isolate, Organism Infraspecific Names Sex, Annotation Name, Assembly Stats Total Sequence Length, Assembly Level, Assembly Submission, and WGS project accession. How to use the database- Download the hash.k2d.gz, opts.k2d, and taxo.k2d files to the same directory (e.g. /PATH/TO/DATABASE/).
    • Unzip the hash.k2d.gz file.
    • Install or load Kraken2 to run classification on sequencing data using the database.
    • When running Kraken2, indicate the path to the directory (not the individual files) with the --db flag (e.g. kraken2 --db /PATH/TO/DATABASE/ ...). Note that the whole database must be loaded into memory by Kraken2 to be able to classify any sequencing reads, so ensure you have access to enough memory before running (the uncompressed hash file is around 1.1 TB).

    We also recommend using the Kraken2 option --memory-mapping, as it ensures the database is loaded once for all samples, instead of once for each individual sample, saving considerable time and resources.

    For more information on using Kraken2, see the Kraken2 wiki manual (https://github.com/DerrickWood/kraken2/wiki/Manual) .

    This database was built by Samantha López Clinton (samantha.lopezclinton@nrm) and Tom van der Valk (tom.vandervalk@nrm.se).

  15. Comprehensive Medical Q&A Dataset

    • kaggle.com
    zip
    Updated Nov 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Comprehensive Medical Q&A Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/comprehensive-medical-q-a-dataset
    Explore at:
    zip(5126941 bytes)Available download formats
    Dataset updated
    Nov 24, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Comprehensive Medical Q&A Dataset

    Unlocking Healthcare Data with Natural Language Processing

    By Huggingface Hub [source]

    About this dataset

    The MedQuad dataset provides a comprehensive source of medical questions and answers for natural language processing. With over 43,000 patient inquiries from real-life situations categorized into 31 distinct types of questions, the dataset offers an invaluable opportunity to research correlations between treatments, chronic diseases, medical protocols and more. Answers provided in this database come not only from doctors but also other healthcare professionals such as nurses and pharmacists, providing a more complete array of responses to help researchers unlock deeper insights within the realm of healthcare. This incredible trove of knowledge is just waiting to be mined - so grab your data mining equipment and get exploring!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    In order to make the most out of this dataset, start by having a look at the column names and understanding what information they offer: qtype (the type of medical question), Question (the question in itself), and Answer (the expert response). The qtype column will help you categorize the dataset according to your desired question topics. Once you have filtered down your criteria as much as possible using qtype, it is time to analyze the data. Start by asking yourself questions such as “What treatments do most patients search for?” or “Are there any correlations between chronic conditions and protocols?” Then use simple queries such as SELECT Answer FROM MedQuad WHERE qtype='Treatment' AND Question LIKE '%pain%' to get closer to answering those questions.

    Once you have obtained new insights about healthcare based on the answers provided in this dynmaic data set - now it’s time for action! Use all that newfound understanding about patient needs in order develop educational materials and implement any suggested changes necessary. If more criteria are needed for querying this data set see if MedQuad offers additional columns; sometimes extra columns may be added periodically that could further enhance analysis capabilities; look out for notifications if these happen.

    Finally once making an impact with the use case(s) - don't forget proper citation etiquette; give credit where credit is due!

    Research Ideas

    • Developing medical diagnostic tools that use natural language processing (NLP) to better identify and diagnose health conditions in patients.
    • Creating predictive models to anticipate treatment options for different medical conditions using machine learning techniques.
    • Leveraging the dataset to build chatbots and virtual assistants that are able to answer a broad range of questions about healthcare with expert-level accuracy

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: train.csv | Column name | Description | |:--------------|:------------------------------------------------------| | qtype | The type of medical question. (String) | | Question | The medical question posed by the patient. (String) | | Answer | The expert response to the medical question. (String) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.

  16. Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter...

    • zenodo.org
    bz2
    Updated Mar 15, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana (2021). Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks [Dataset]. http://doi.org/10.5281/zenodo.2592524
    Explore at:
    bz2Available download formats
    Dataset updated
    Mar 15, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub.

    Paper: https://2019.msrconf.org/event/msr-2019-papers-a-large-scale-study-about-quality-and-reproducibility-of-jupyter-notebooks

    This repository contains two files:

    • dump.tar.bz2
    • jupyter_reproducibility.tar.bz2

    The dump.tar.bz2 file contains a PostgreSQL dump of the database, with all the data we extracted from the notebooks.

    The jupyter_reproducibility.tar.bz2 file contains all the scripts we used to query and download Jupyter Notebooks, extract data from them, and analyze the data. It is organized as follows:

    • analyses: this folder has all the notebooks we use to analyze the data in the PostgreSQL database.
    • archaeology: this folder has all the scripts we use to query, download, and extract data from GitHub notebooks.
    • paper: empty. The notebook analyses/N12.To.Paper.ipynb moves data to it

    In the remaining of this text, we give instructions for reproducing the analyses, by using the data provided in the dump and reproducing the collection, by collecting data from GitHub again.

    Reproducing the Analysis

    This section shows how to load the data in the database and run the analyses notebooks. In the analysis, we used the following environment:

    Ubuntu 18.04.1 LTS
    PostgreSQL 10.6
    Conda 4.5.11
    Python 3.7.2
    PdfCrop 2012/11/02 v1.38

    First, download dump.tar.bz2 and extract it:

    tar -xjf dump.tar.bz2

    It extracts the file db2019-03-13.dump. Create a database in PostgreSQL (we call it "jupyter"), and use psql to restore the dump:

    psql jupyter < db2019-03-13.dump

    It populates the database with the dump. Now, configure the connection string for sqlalchemy by setting the environment variable JUP_DB_CONNECTTION:

    export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter";

    Download and extract jupyter_reproducibility.tar.bz2:

    tar -xjf jupyter_reproducibility.tar.bz2

    Create a conda environment with Python 3.7:

    conda create -n analyses python=3.7
    conda activate analyses

    Go to the analyses folder and install all the dependencies of the requirements.txt

    cd jupyter_reproducibility/analyses
    pip install -r requirements.txt

    For reproducing the analyses, run jupyter on this folder:

    jupyter notebook

    Execute the notebooks on this order:

    • Index.ipynb
    • N0.Repository.ipynb
    • N1.Skip.Notebook.ipynb
    • N2.Notebook.ipynb
    • N3.Cell.ipynb
    • N4.Features.ipynb
    • N5.Modules.ipynb
    • N6.AST.ipynb
    • N7.Name.ipynb
    • N8.Execution.ipynb
    • N9.Cell.Execution.Order.ipynb
    • N10.Markdown.ipynb
    • N11.Repository.With.Notebook.Restriction.ipynb
    • N12.To.Paper.ipynb

    Reproducing or Expanding the Collection

    The collection demands more steps to reproduce and takes much longer to run (months). It also involves running arbitrary code on your machine. Proceed with caution.

    Requirements

    This time, we have extra requirements:

    All the analysis requirements
    lbzip2 2.5
    gcc 7.3.0
    Github account
    Gmail account

    Environment

    First, set the following environment variables:

    export JUP_MACHINE="db"; # machine identifier
    export JUP_BASE_DIR="/mnt/jupyter/github"; # place to store the repositories
    export JUP_LOGS_DIR="/home/jupyter/logs"; # log files
    export JUP_COMPRESSION="lbzip2"; # compression program
    export JUP_VERBOSE="5"; # verbose level
    export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; # sqlchemy connection
    export JUP_GITHUB_USERNAME="github_username"; # your github username
    export JUP_GITHUB_PASSWORD="github_password"; # your github password
    export JUP_MAX_SIZE="8000.0"; # maximum size of the repositories directory (in GB)
    export JUP_FIRST_DATE="2013-01-01"; # initial date to query github
    export JUP_EMAIL_LOGIN="gmail@gmail.com"; # your gmail address
    export JUP_EMAIL_TO="target@email.com"; # email that receives notifications
    export JUP_OAUTH_FILE="~/oauth2_creds.json" # oauth2 auhentication file
    export JUP_NOTEBOOK_INTERVAL=""; # notebook id interval for this machine. Leave it in blank
    export JUP_REPOSITORY_INTERVAL=""; # repository id interval for this machine. Leave it in blank
    export JUP_WITH_EXECUTION="1"; # run execute python notebooks
    export JUP_WITH_DEPENDENCY="0"; # run notebooks with and without declared dependnecies
    export JUP_EXECUTION_MODE="-1"; # run following the execution order
    export JUP_EXECUTION_DIR="/home/jupyter/execution"; # temporary directory for running notebooks
    export JUP_ANACONDA_PATH="~/anaconda3"; # conda installation path
    export JUP_MOUNT_BASE="/home/jupyter/mount_ghstudy.sh"; # bash script to mount base dir
    export JUP_UMOUNT_BASE="/home/jupyter/umount_ghstudy.sh"; # bash script to umount base dir
    export JUP_NOTEBOOK_TIMEOUT="300"; # timeout the extraction
    
    
    # Frequenci of log report
    export JUP_ASTROID_FREQUENCY="5";
    export JUP_IPYTHON_FREQUENCY="5";
    export JUP_NOTEBOOKS_FREQUENCY="5";
    export JUP_REQUIREMENT_FREQUENCY="5";
    export JUP_CRAWLER_FREQUENCY="1";
    export JUP_CLONE_FREQUENCY="1";
    export JUP_COMPRESS_FREQUENCY="5";
    
    export JUP_DB_IP="localhost"; # postgres database IP

    Then, configure the file ~/oauth2_creds.json, according to yagmail documentation: https://media.readthedocs.org/pdf/yagmail/latest/yagmail.pdf

    Configure the mount_ghstudy.sh and umount_ghstudy.sh scripts. The first one should mount the folder that stores the directories. The second one should umount it. You can leave the scripts in blank, but it is not advisable, as the reproducibility study runs arbitrary code on your machine and you may lose your data.

    Scripts

    Download and extract jupyter_reproducibility.tar.bz2:

    tar -xjf jupyter_reproducibility.tar.bz2

    Install 5 conda environments and 5 anaconda environments, for each python version. In each of them, upgrade pip, install pipenv, and install the archaeology package (Note that it is a local package that has not been published to pypi. Make sure to use the -e option):

    Conda 2.7

    conda create -n raw27 python=2.7 -y
    conda activate raw27
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Anaconda 2.7

    conda create -n py27 python=2.7 anaconda -y
    conda activate py27
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology
    

    Conda 3.4

    It requires a manual jupyter and pathlib2 installation due to some incompatibilities found on the default installation.

    conda create -n raw34 python=3.4 -y
    conda activate raw34
    conda install jupyter -c conda-forge -y
    conda uninstall jupyter -y
    pip install --upgrade pip
    pip install jupyter
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology
    pip install pathlib2

    Anaconda 3.4

    conda create -n py34 python=3.4 anaconda -y
    conda activate py34
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Conda 3.5

    conda create -n raw35 python=3.5 -y
    conda activate raw35
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Anaconda 3.5

    It requires the manual installation of other anaconda packages.

    conda create -n py35 python=3.5 anaconda -y
    conda install -y appdirs atomicwrites keyring secretstorage libuuid navigator-updater prometheus_client pyasn1 pyasn1-modules spyder-kernels tqdm jeepney automat constantly anaconda-navigator
    conda activate py35
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Conda 3.6

    conda create -n raw36 python=3.6 -y
    conda activate raw36
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Anaconda 3.6

    conda create -n py36 python=3.6 anaconda -y
    conda activate py36
    conda install -y anaconda-navigator jupyterlab_server navigator-updater
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Conda 3.7

    <code

  17. Transport Indicator Database Survey 2012 - Ghana

    • microdata.statsghana.gov.gh
    Updated Mar 14, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ghana Statistical Service (GSS) (2016). Transport Indicator Database Survey 2012 - Ghana [Dataset]. https://microdata.statsghana.gov.gh/index.php/catalog/82
    Explore at:
    Dataset updated
    Mar 14, 2016
    Dataset provided by
    Ghana Statistical Services
    Authors
    Ghana Statistical Service (GSS)
    Time period covered
    2012
    Area covered
    Ghana
    Description

    Abstract

    The efficient development, maintenance and administration of transport infrastructure and services are critical to the socio-economic development of any country. Scarce government resources and support from donor funds are required to provide these essential services to all sectors for the economic development of the country and for attaining equity and the participation of the populace in the creation of wealth and reduction of poverty.

    To ascertain the effectiveness of implementation of policies and development programs, for transport related infrastructure and services key performance indicators are required. The data for developing these performance indicators must be collected on a sustainable basis by the various sectors for collation and analysis. Although most of the relevant basic data exist in many establishments, these are often scattered and are not collated nor disseminated in any structured manner. The Transportation sector is no exception. A recent study of the Ghana Road Sub-sector Programme finds that there is an urgent need to reinforce the monitoring system of MRT as performance indicators have only partially been collected and used; the road condition mix is monitored on an annual basis while other basic performance indicators are lacking. A good monitoring system will help improve the policy formulation within the sub-sector while its absence may result in a major fund funding reduction because the contribution to national development objectives, such as poverty alleviation, cannot be substantiated and demonstrated.

    Objectives of survey The development objective of the TSPS-II as defined in the Ghana Poverty Reduction Strategy (GPRS), to sustain economic growth through the provision of safe, reliable, efficient and affordable services for all transport users. The focus of the transport sector under the GPRS is to provide access through better distribution of the transport network with special emphasis on high poverty areas in order to reduce transport disparities between the urban and rural communities. The household survey is a component of a bigger programme which will serve as a reliable and sustainable one-stop shop for all the data and performance indicators for the transport sector. The immediate objective of the sub-component is to improve the effectiveness of implementation of policies and development programmes for the transport sector, including related infrastructure and services. The direct aim of the sub-component will be the collection, processing, analysis, documentation and dissemination of transport related data, which will be useful for:

    1. Transport planning and policy formulation;
    2. Impact assessment, monitoring and evaluation of policies and programmes;
    3. Measuring the contribution of the transport to the achievement of MDGs;
    4. Impact assessment of the transport sector on poverty alleviation and the general standard of living;
    5. Comparisons of performance of the transport sector over time and between countries for the purpose of drawing lessons and giving an indication of where interventions are necessary;
    6. Provision of a comprehensive database for justification of programmes and projects under the Multi-Donor Budgetary Support (MDBS).

    Geographic coverage

    National level Region Level

    Analysis unit

    Household and Individual

    Universe

    The survey covered all household members (Usual residents)

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    The sample was representative of all households in Ghana. To achieve the study objectives, the sample size chosen was based on the type of variables under consideration, the required precision of the survey estimates and available resources. Taking all of these into consideration, a sample size of 6,000 households was deemed sufficient to achieve the survey objectives. This was enough to yield reliable estimates of all the important survey variables as well as being manageable to control and minimize non-sampling errors.

    Stratification and Sample Selection Procedures The total list of the Enumeration Areas (EAs) from the demarcation for the 2010 Population and Housing Census formed the sampling frame for the Phase II of the Transport Indicators Survey. The sampling frame was stratified into urban/rural residence and the 10 administrative regions of the country for the selection of the sample. The sample was selected in two stages.

    The first stage selection involved the systematic selection of 400 EAs with probability proportional to size, the measure of size being the number of households in each EA. The second stage selection involved the systematic selection of 15 households from each EA. See Appendix A for more details on the sample design.

    Sampling deviation

    No deviations

    Mode of data collection

    Face-to-face [f2f]

    Research instrument

    The questionnaire had the following sections:

    Section A: a household roster which collected basic information on all households members and household characteristics to determine eligible household members

    Section B: an education section which was administered to household members aged 3 years and older on the use of transport services to school

    Section C: a health section that was used to collect information on all household members on access and the use of transport services to health facilities

    Section D: an economic activity section administered to household members 7 years and older to collect information on their economic activities and the use of transport services a market access section administered to household members engaged in agricultural activities to collect information on access to transport services for sale of farm produce

    Section E: a general transport services section administered to all household members on the access and use of various modes of transport.

    Section F: a general transport services section administered to all households and use of various modes of transport.

    Cleaning operations

    Control mechanisms were inbuilt in the data capturing application. Range checks and skip patterns were incorporated into the data capturing application. Partial double entry was done in order to compare and correct errors. After data capture secondary editng was done in the form of consistency checks. CSPro 4.1 was used to capture the data.

    Response rate

    National: (5996/6000)*100=99.93%

    By Regions: Western=99.8% Central= 100.0% Greater Accra= 100.0% Volta = 99.5% Eastern=100.0% Ashanti = 100.0% Brong Ahafo = 100.0% Northern = 100.0% Upper East = 100.0% Upper West = 100.0%

    Region Hhs completed Hhs Expected Response rate
    Western 569 570 99.8
    Central 510 510 100.0
    Greater Accra 855 855 100.0
    Volta 567 570 99.5
    Eastern 705 705 100.0
    Ashanti 1,125 1,125 100.0
    Brong Ahafo 585 585 100.0
    Northern 615 615 100.0
    Upper East 285 285 100.0
    Upper West 180 180 100.0
    Total 5,996 6,000 99.9

    Causes of non response Region
    Result of Interview Western Volta Total
    Refused 1 0 1
    No HHold Member at Home 0 2 2
    Other 0 1 1
    Total 1 3 4

    Sampling error estimates

    Sample errors was calculated but not in the report.

    Data appraisal

    No other forms of data appraisal

  18. w

    The Nationwide Readmissions Database

    • datacatalog.library.wayne.edu
    Updated Jun 19, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Agency for Healthcare Research and Quality (AHRQ) (2020). The Nationwide Readmissions Database [Dataset]. https://datacatalog.library.wayne.edu/dataset/the-nationwide-readmissions-database
    Explore at:
    Dataset updated
    Jun 19, 2020
    Dataset provided by
    U.S. Agency for Healthcare Research and Quality (AHRQ)
    Description

    The Nationwide Readmissions Database (NRD) is a unique and powerful database designed to support various types of analyses of national readmission rates for all patients regardless of the expected payer for the hospital stay. The NRD includes discharges for patients with and without repeat hospital visits in a year and those who have died in the hospital. Repeat stays may or may not be related. The criteria to determine the relationship between hospital admissions is left to the analyst using the NRD. This database addresses a large gap in healthcare data - the lack of nationally representative information on hospital readmissions for all ages.

  19. d

    Get access of 69 Million Professional's Email Database

    • datarade.ai
    .json, .csv
    Updated Sep 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bytescraper (2025). Get access of 69 Million Professional's Email Database [Dataset]. https://datarade.ai/data-products/get-access-of-69-million-professional-s-email-database-b2b-email-databases
    Explore at:
    .json, .csvAvailable download formats
    Dataset updated
    Sep 12, 2025
    Dataset authored and provided by
    Bytescraper
    Area covered
    Germany, Switzerland, Japan, Spain, Canada, India, New Zealand, Italy, South Africa, United Kingdom
    Description

    A good DATA is crucial for any business or organization to grow the network. This is because all relevant details about the company and user are stored in the database. Your companies have benefited from using our email database to extract their prospect's details.

    It is a well-known fact that LinkedIn gives you the opportunity to expand your business network. You can easily connect with your prospects, directly or through mutual connections, by using search keywords related to their name, company, profile, address, etc. However, we're a leading data provider, with us you do not need to do such a thing. Our Professional's email database contains all the necessary business information from your prospects. There are several ways to access them (especially email addresses and phone numbers).

    With our service, you can reach over 69 million records in 200+ countries. Our database is well organized and keeps information easily accessible, so you can use it. Easily increase your sales with reliable LinkedIn data that connects you directly to your goal, here we have worked hard to supply quality, reliable, sustainable email databases.

  20. mimic-iii-clinical-database-demo-1.4

    • kaggle.com
    zip
    Updated Apr 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Montassar bellah (2025). mimic-iii-clinical-database-demo-1.4 [Dataset]. https://www.kaggle.com/datasets/montassarba/mimic-iii-clinical-database-demo-1-4
    Explore at:
    zip(11100065 bytes)Available download formats
    Dataset updated
    Apr 1, 2025
    Authors
    Montassar bellah
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Abstract MIMIC-III is a large, freely-available database comprising deidentified health-related data associated with over 40,000 patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012 [1]. The MIMIC-III Clinical Database is available on PhysioNet (doi: 10.13026/C2XW26). Though deidentified, MIMIC-III contains detailed information regarding the care of real patients, and as such requires credentialing before access. To allow researchers to ascertain whether the database is suitable for their work, we have manually curated a demo subset, which contains information for 100 patients also present in the MIMIC-III Clinical Database. Notably, the demo dataset does not include free-text notes.

    Background In recent years there has been a concerted move towards the adoption of digital health record systems in hospitals. Despite this advance, interoperability of digital systems remains an open issue, leading to challenges in data integration. As a result, the potential that hospital data offers in terms of understanding and improving care is yet to be fully realized.

    MIMIC-III integrates deidentified, comprehensive clinical data of patients admitted to the Beth Israel Deaconess Medical Center in Boston, Massachusetts, and makes it widely accessible to researchers internationally under a data use agreement. The open nature of the data allows clinical studies to be reproduced and improved in ways that would not otherwise be possible.

    The MIMIC-III database was populated with data that had been acquired during routine hospital care, so there was no associated burden on caregivers and no interference with their workflow. For more information on the collection of the data, see the MIMIC-III Clinical Database page.

    Methods The demo dataset contains all intensive care unit (ICU) stays for 100 patients. These patients were selected randomly from the subset of patients in the dataset who eventually die. Consequently, all patients will have a date of death (DOD). However, patients do not necessarily die during an individual hospital admission or ICU stay.

    This project was approved by the Institutional Review Boards of Beth Israel Deaconess Medical Center (Boston, MA) and the Massachusetts Institute of Technology (Cambridge, MA). Requirement for individual patient consent was waived because the project did not impact clinical care and all protected health information was deidentified.

    Data Description MIMIC-III is a relational database consisting of 26 tables. For a detailed description of the database structure, see the MIMIC-III Clinical Database page. The demo shares an identical schema, except all rows in the NOTEEVENTS table have been removed.

    The data files are distributed in comma separated value (CSV) format following the RFC 4180 standard. Notably, string fields which contain commas, newlines, and/or double quotes are encapsulated by double quotes ("). Actual double quotes in the data are escaped using an additional double quote. For example, the string she said "the patient was notified at 6pm" would be stored in the CSV as "she said ""the patient was notified at 6pm""". More detail is provided on the RFC 4180 description page: https://tools.ietf.org/html/rfc4180

    Usage Notes The MIMIC-III demo provides researchers with an opportunity to review the structure and content of MIMIC-III before deciding whether or not to carry out an analysis on the full dataset.

    CSV files can be opened natively using any text editor or spreadsheet program. However, some tables are large, and it may be preferable to navigate the data stored in a relational database. One alternative is to create an SQLite database using the CSV files. SQLite is a lightweight database format which stores all constituent tables in a single file, and SQLite databases interoperate well with a number software tools.

    DB Browser for SQLite is a high quality, visual, open source tool to create, design, and edit database files compatible with SQLite. We have found this tool to be useful for navigating SQLite files. Information regarding installation of the software and creation of the database can be found online: https://sqlitebrowser.org/

    Release Notes Release notes for the demo follow the release notes for the MIMIC-III database.

    Acknowledgements This research and development was supported by grants NIH-R01-EB017205, NIH-R01-EB001659, and NIH-R01-GM104987 from the National Institutes of Health. The authors would also like to thank Philips Healthcare and staff at the Beth Israel Deaconess Medical Center, Boston, for supporting database development, and Ken Pierce for providing ongoing support for the MIMIC research community.

    Conflicts of Interest The authors declare no competing financial interests.

    References Johnson, A. E. W., Pollard, T. J., Shen, L., Lehman, L. H., Feng, M., Ghassemi, M., Mo...

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
U.S. EPA Office of Research and Development (ORD) (2020). Functional Use Database (FUse) [Dataset]. https://catalog.data.gov/dataset/functional-use-database-fuse
Organization logo

Functional Use Database (FUse)

Explore at:
22 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Nov 12, 2020
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description

There are five different files for this dataset: 1. A dataset listing the reported functional uses of chemicals (FUse) 2. All 729 ToxPrint descriptors obtained from ChemoTyper for chemicals in FUse 3. All EPI Suite properties obtained for chemicals in FUse 4. The confusion matrix values, similarity thresholds, and bioactivity index for each model. 5. The functional use prediction, bioactivity index, and prediction classification (poor prediction, functional substitute, candidate alternative) for each Tox21 chemical. This dataset is associated with the following publication: Phillips, K., J. Wambaugh, C. Grulke, K. Dionisio, and K. Isaacs. High-throughput screening of chemicals as functional substitutes using structure-based classification models. GREEN CHEMISTRY. Royal Society of Chemistry, Cambridge, UK, 19: 1063-1074, (2017).

Search
Clear search
Close search
Google apps
Main menu