100+ datasets found

Functional Use Database (FUse)
catalog.data.gov
datasets.ai
+1more
Updated Nov 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2020). Functional Use Database (FUse) [Dataset]. https://catalog.data.gov/dataset/functional-use-database-fuse
Explore at:
Dataset updated
Nov 12, 2020
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
There are five different files for this dataset: 1. A dataset listing the reported functional uses of chemicals (FUse) 2. All 729 ToxPrint descriptors obtained from ChemoTyper for chemicals in FUse 3. All EPI Suite properties obtained for chemicals in FUse 4. The confusion matrix values, similarity thresholds, and bioactivity index for each model. 5. The functional use prediction, bioactivity index, and prediction classification (poor prediction, functional substitute, candidate alternative) for each Tox21 chemical. This dataset is associated with the following publication: Phillips, K., J. Wambaugh, C. Grulke, K. Dionisio, and K. Isaacs. High-throughput screening of chemicals as functional substitutes using structure-based classification models. GREEN CHEMISTRY. Royal Society of Chemistry, Cambridge, UK, 19: 1063-1074, (2017).
H
Data from: Agri-Food System Water Use Database
dataverse.harvard.edu
search.dataone.org
Updated Jun 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
International Food Policy Research Institute (IFPRI) (2024). Agri-Food System Water Use Database [Dataset]. http://doi.org/10.7910/DVN/FZK8WE
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/FZK8WE
Dataset updated
Jun 4, 2024
Dataset provided by
Harvard Dataverse
Authors
International Food Policy Research Institute (IFPRI)
License
https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.7910/DVN/FZK8WEhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.7910/DVN/FZK8WE
Time period covered
2022
Description
This database provides information about the amount of water use in agriculture food systems covering all sectors from farming to food processing industries. The data are presented at the country level with sectoral disaggregation following the Nexus Social Accounting Matrix (SAM) sectoral specifications. The database also differentiates the type of water in each sector based on water sources. The green water refers to type of water originated from precipitation or rain, while the blue water refers to all water that comes from irrigation covering both surface and groundwater. Both types of water are consumed by plants or animals during the production process. The grey water on the other hand is the amount of water generated as an implication from production activities that cause the water polluted. Since it has loads of pollutants created from production activities, this type of water can be seen as a waste in the whole production system.
O*NET Database
onetcenter.org
excel, mysql, oracle +2
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Center for O*NET Development, O*NET Database [Dataset]. https://www.onetcenter.org/database.html
Explore at:
oracle, sql server, text, mysql, excelAvailable download formats
Dataset provided by
Occupational Information Network
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
United States
Dataset funded by
US Department of Labor, Employment and Training Administration
Description
The O*NET Database contains hundreds of standardized and occupation-specific descriptors on almost 1,000 occupations covering the entire U.S. economy. The database, which is available to the public at no cost, is continually updated by a multi-method data collection program. Sources of data include: job incumbents, occupational experts, occupational analysts, employer job postings, and customer/professional association input.
Data content areas include:
Worker Characteristics (e.g., Abilities, Interests, Work Styles)
Worker Requirements (e.g., Education, Knowledge, Skills)
Experience Requirements (e.g., On-the-Job Training, Work Experience)
Occupational Requirements (e.g., Detailed Work Activities, Work Context)
Occupation-Specific Information (e.g., Job Titles, Tasks, Technology Skills)
Dofus Database
kaggle.com
zip
Updated Aug 6, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PostMortem (2022). Dofus Database [Dataset]. https://www.kaggle.com/datasets/pstmrtem/dofus-dabase
Explore at:
zip(1552056 bytes)Available download formats
Dataset updated
Aug 6, 2022
Authors
PostMortem
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dofus Database

How it has been gathered

This dataset has been produced by scrapping the encyclopedia of Dofus' website.

It was first done as a challenge, and a way to gather some textual data about the game. You can find the code used to scrap and parse the data here : https://github.com/Futurne/dofus_scrap.

Quick presentation

The main files are the json files, where you will find all scrapped items of the game from the encyclopedia. Those files are named after their categories in the encyclopedia (e.g. you will find all weapons in the armes.json file). You can explore those files, they are pretty self explanatory.

Another dataset here is the almanax.csv, which is just a dataset of every descriptions of the almanax, scrapped for a whole year. For each day, you will find the boss, rubrikabrax and meryde description.

How to use it

You can use this dataset to finetune a pretrained french model by getting all textual informations (in the almanax.csv and in the json files). The items have a description property that can be gathered as a big NLP dataset.

You can also do some data analysis : will you find what items are overpowered ? Are the harder to craft ?

Finally, one other idea would be to build an automatic stuff optimizer. You could ask for the best dofus stuff you could have, that maximizes one element while satisfying some constraints.

Note that I ended up finding that there is already a non-official (just like my dataset) API allowing one to get all data. You can checkout their project here : https://dofapi.fr/. They seem not to have updated their project for a long time now though.
DataSheet2_Data Sources for Drug Utilization Research in Brazil—DUR-BRA...
frontiersin.figshare.com
datasetcatalog.nlm.nih.gov
xlsx
Updated Jun 1, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lisiane Freitas Leal; Claudia Garcia Serpa Osorio-de-Castro; Luiz Júpiter Carneiro de Souza; Felipe Ferre; Daniel Marques Mota; Marcia Ito; Monique Elseviers; Elisangela da Costa Lima; Ivan Ricardo Zimmernan; Izabela Fulone; Monica Da Luz Carvalho-Soares; Luciane Cruz Lopes (2023). DataSheet2_Data Sources for Drug Utilization Research in Brazil—DUR-BRA Study.xlsx [Dataset]. http://doi.org/10.3389/fphar.2021.789872.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/fphar.2021.789872.s002
Dataset updated
Jun 1, 2023
Dataset provided by
Frontiers Mediahttp://www.frontiersin.org/
Authors
Lisiane Freitas Leal; Claudia Garcia Serpa Osorio-de-Castro; Luiz Júpiter Carneiro de Souza; Felipe Ferre; Daniel Marques Mota; Marcia Ito; Monique Elseviers; Elisangela da Costa Lima; Ivan Ricardo Zimmernan; Izabela Fulone; Monica Da Luz Carvalho-Soares; Luciane Cruz Lopes
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Brazil
Description
Background: In Brazil, studies that map electronic healthcare databases in order to assess their suitability for use in pharmacoepidemiologic research are lacking. We aimed to identify, catalogue, and characterize Brazilian data sources for Drug Utilization Research (DUR).Methods: The present study is part of the project entitled, “Publicly Available Data Sources for Drug Utilization Research in Latin American (LatAm) Countries.” A network of Brazilian health experts was assembled to map secondary administrative data from healthcare organizations that might provide information related to medication use. A multi-phase approach including internet search of institutional government websites, traditional bibliographic databases, and experts’ input was used for mapping the data sources. The reviewers searched, screened and selected the data sources independently; disagreements were resolved by consensus. Data sources were grouped into the following categories: 1) automated databases; 2) Electronic Medical Records (EMR); 3) national surveys or datasets; 4) adverse event reporting systems; and 5) others. Each data source was characterized by accessibility, geographic granularity, setting, type of data (aggregate or individual-level), and years of coverage. We also searched for publications related to each data source.Results: A total of 62 data sources were identified and screened; 38 met the eligibility criteria for inclusion and were fully characterized. We grouped 23 (60%) as automated databases, four (11%) as adverse event reporting systems, four (11%) as EMRs, three (8%) as national surveys or datasets, and four (11%) as other types. Eighteen (47%) were classified as publicly and conveniently accessible online; providing information at national level. Most of them offered more than 5 years of comprehensive data coverage, and presented data at both the individual and aggregated levels. No information about population coverage was found. Drug coding is not uniform; each data source has its own coding system, depending on the purpose of the data. At least one scientific publication was found for each publicly available data source.Conclusions: There are several types of data sources for DUR in Brazil, but a uniform system for drug classification and data quality evaluation does not exist. The extent of population covered by year is unknown. Our comprehensive and structured inventory reveals a need for full characterization of these data sources.
HCUP California
stanford.redivis.com
redivis.com
application/jsonl +7
Updated May 20, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stanford Center for Population Health Sciences (2020). HCUP California [Dataset]. http://doi.org/10.57761/krfh-m184
Explore at:
application/jsonl, arrow, parquet, sas, avro, spss, csv, stataAvailable download formats
Unique identifier
https://doi.org/10.57761/krfh-m184
Dataset updated
May 20, 2020
Dataset provided by
Redivis Inc.
Authors
Stanford Center for Population Health Sciences
Time period covered
Jan 1, 2008 - Dec 31, 2011
Area covered
California
Description
Abstract

The State Ambulatory Surgery Databases (SASD), State Inpatient Databases (SID), and State Emergency Department Databases (SEDD) are part of a family of databases and software tools developed for the Healthcare Cost and Utilization Project (HCUP).

HCUP's state-specific databases can be used to investigate state-specific and multi-state trends in health care utilization, access, charges, quality, and outcomes. PHS has several years (2008-2011) and datasets (SASSD, SED and SIDD) for HCUP California available.

Usage

The State Ambulatory Surgery and Services Databases (SASD) are State-specific files that include data for ambulatory surgery and other outpatient services from hospital-owned facilities. In addition, some States provide ambulatory surgery and outpatient services from nonhospital-owned facilities. The uniform format of the SASD helps facilitate cross-State comparisons. The SASD are well suited for research that requires complete enumeration of hospital-based ambulatory surgeries within geographic areas or States.

The State Inpatient Databases (SID) are State-specific files that contain all inpatient care records in participating states. Together, the SID encompass more than 95 percent of all U.S. hospital discharges. The uniform format of the SID helps facilitate cross-state comparisons. In addition, the SID are well suited for research that requires complete enumeration of hospitals and discharges within geographic areas or states.

The State Emergency Department Databases (SEDD) are a set of longitudinal State-specific emergency department (ED) databases included in the HCUP family. The SEDD capture discharge information on all emergency department visits that do not result in an admission. Information on patients seen in the emergency room and then admitted to the hospital is included in the State Inpatient Databases (SID)

SASD, SID, and SEDD each have **Documentation **which includes:

Description of the Database

Restrictions on Use

File Specifications and Load Program

Data Elements

Additional Resources for Data Elements

ICD-10-CM/PCS Data Included in the Dataset Starting with 2015

Known Data Issues

HCUP Tools: Labels and Formats

HCUP Supplemental Files

Obtaining HCUP Data

%3C!-- --%3E

Before Manuscript Submission

All manuscripts (and other items you'd like to publish) must be submitted to

phsdatacore@stanford.edu for approval prior to journal submission.

We will check your cell sizes and citations.

For more information about how to cite PHS and PHS datasets, please visit:

https:/phsdocs.developerhub.io/need-help/citing-phs-data-core

Documentation

The HCUP California inpatient files were constructed from the confidential files received from the Office of Statewide Health Planning and Development (OSHPD). OSHPD excluded inpatient stays that, after processing by OSHPD, did not contain a complete and “in-range” admission date or discharge date. California also excluded inpatient stays that had an unknown or missing date of birth. OSHPD removes ICD-9-CM and ICD-10-CM diagnoses codes for HIV test results. Beginning with 2009 data, OSHPD changed regulations to require hospitals to report all external cause of injury diagnosis codes including those specific to medical misadventures. Prior to 2009, OSHPD did not require collection of diagnosis codes identifying medical misadventures.

**Types of Facilities Included in the Files Provided to HCUP by the Partner **

California supplied discharge data for inpatient stays in general acute care hospitals, acute psychiatric hospitals, chemical dependency recovery hospitals, psychiatric health facilities, and state operated hospitals. A comparison of the number of hospitals included in the SID and the number of hospitals reported in the AHA Annual Survey is available starting in data year 2010. Hospitals do not always report data for a full calendar year. Some hospitals open or close during the year; other hospitals have technical problems that prevent them from reporting data for all months in a year.

**Inclusion of Stays in Special Units **

Included with the general acute care stays are stays in skilled nursing, intermediate care, rehabilitation, alcohol/chemical dependency treatment, and psychiatric units of hospitals in California. How the stays in these different types of units can be identified differs by data year. Beginning in 2006, the information is retained in the HCUP variable HOSPITALUNIT. Reliability of this indicator for the level of care depends on how it was assigned by the hospital. For data years 1998-2006, the information was retained in the HCUP variable LEVELCARE. Prior to 1998, the first
Time Series International Database: International Populations by Single Year...
catalog.data.gov
s.cnmilf.com
Updated Sep 30, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Census Bureau (2025). Time Series International Database: International Populations by Single Year of Age and Sex [Dataset]. https://catalog.data.gov/dataset/international-data-base-time-series-international-database-international-populations-by-si
Explore at:
Dataset updated
Sep 30, 2025
Dataset provided by
United States Census Bureauhttp://census.gov/
Description
Midyear population estimates and projections for all countries and areas of the world with a population of 5,000 or more // Source: U.S. Census Bureau, Population Division, International Programs Center// Note: Total population available from 1950 to 2100 for 227 countries and areas. Other demographic variables available from base year to 2100. Base year varies by country and therefore data are not available for all years for all countries. See methodologyhttps://www.census.gov/programs-surveys/international-programs/about/idb.html
Database & Directory Publishing in the US - Market Research Report...
ibisworld.com
Updated Nov 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
IBISWorld (2024). Database & Directory Publishing in the US - Market Research Report (2015-2030) [Dataset]. https://www.ibisworld.com/united-states/market-research-reports/database-directory-publishing-industry/
Explore at:
Dataset updated
Nov 15, 2024
Dataset authored and provided by
IBISWorld
License
https://www.ibisworld.com/about/termsofuse/https://www.ibisworld.com/about/termsofuse/
Time period covered
2014 - 2029
Description
With the phone book era far in the past, database and directory publishers have been forced to transform their business approach, focusing on their digital presence. Despite many publishers rapidly moving away from print services, they are experiencing immovable competition from online search engines and social media platforms within the digital space, negatively affecting revenue growth potential. Industry revenue has been eroding at a CAGR of 4.4% over the past five years and in 2024, a 3.9% drop has led to the industry revenue totaling $4.4 billion. Profit continues to drop in line with revenue, accounting for 4.7% of revenue as publishers invest more in their digital platforms. Interest in printed directories has disappeared as institutional clients and consumers have continued their shift to convenient online resources. Declining demand for print advertising has curbed revenue growth and online revenue has only slightly mitigated this downturn. Though many traditional publishers, such as Yellow Pages, now operate under parent companies with digital resources, directory publishers remain low on the list of options businesses have to choose from in digital advertising. Due to the convenience and connectivity that Facebook and Google services offer, traditional directory publishers have a limited ability to compete. Many providers have rebranded and tailored their services toward client needs, though these efforts have only had a marginal impact on revenue growth. The industry is forecast to decline at an accelerated CAGR of 5.2% over the next five years, reaching an estimated $3.4 billion in 2029, as businesses and consumers continually turn to digital alternatives for information and advertising opportunities. As AI and digital technology innovation expands, social media company products will likely improve at a faster rate than the digital offerings that directory publishers can provide. Though these companies will seek external partnerships to cut costs, they face an uphill battle to boost their visibility and reverse consumer habit trends.
d
Habitat Use Database - Groundfish Essential Fish Habitat (EFH) Habitat Use...
catalog.data.gov
fisheries.noaa.gov
Updated May 24, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(Point of Contact, Custodian) (2025). Habitat Use Database - Groundfish Essential Fish Habitat (EFH) Habitat Use Database (HUD) [Dataset]. https://catalog.data.gov/dataset/habitat-use-database-groundfish-essential-fish-habitat-efh-habitat-use-database-hud3
Explore at:
Dataset updated
May 24, 2025
Dataset provided by
(Point of Contact, Custodian)
Description
The Habitat Use Database (HUD) was specifically designed to address the need for habitat-use analyses in support of groundfish EFH, HAPCs, and fishing and nonfishing impacts components of the 2005 EFH EIS. HUD functionality and accessibility, and the ecological information upon which the HUD is based, will be improved in order for this database to fully support fisheries and ecosystem science and management. Upgrades to and applications of the HUD will be facilitated through a series of prioritized phases: â€¢ Fully integrate the data entry, quality control, and reporting capabilities from the original HUD Access database with a web-based and programmatic interface. Improve software for HUD to accommodate the most current habitat maps and habitat classification codes. This will be achieved by NMFS in consultation with HUD architects at Oregon State University. â€¢ Review and update the biological and ecological information in the HUD. â€¢ Develop and apply improved models that will be used to create updated habitat suitability maps for all west coast groundfish species using the updated HUD and Pacific coast seafloor habitat maps. â€¢ Integrate habitat suitability models with the online groundfish EFH data catalog (http://efh-catalog.coas.oregonstate.edu/overview/). 2005 habitat-use analysis supporting groundfish EFH.
d
3D-Genomics Database
dknet.org
scicrunch.org
+2more
Updated Jan 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). 3D-Genomics Database [Dataset]. http://identifiers.org/RRID:SCR_007430
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_007430
Dataset updated
Jan 29, 2022
Description
THIS RESOURCE IS NO LONGER IN SERVICE, documented August 29, 2016. Database containing structural annotations for the proteomes of just under 100 organisms. Using data derived from public databases of translated genomic sequences, representatives from the major branches of Life are included: Prokaryota, Eukaryota and Archaea. The annotations stored in the database may be accessed in a number of ways. The help page provides information on how to access the database. 3D-GENOMICS is now part of a larger project, called e-Protein. The project brings together similar databases at three sites: Imperial College London , University College London and the European Bioinformatics Institute . e-Protein''s mission statement is To provide a fully automated distributed pipeline for large-scale structural and functional annotation of all major proteomes via the use of cutting-edge computer GRID technologies. The following databases are incorporated: NRprot, SCOP, ASTRAL, PFAM, Prosite, taxonomy, COG The following eukaryotic genomes are incorporated: Anopheles gambiae, protein sequences from the mosquito genome; Arabidopsis thaliana, protein sequences from the Arabidopsis genome; Caenorhabditis briggsae, protein sequences from the C.briggsae genome; Caenorhabditis elegans protein sequences from the worm genome; Ciona intestinalis protein sequences from the sea squirt genome; Danio rerio protein sequences from the zebrafish genome; Drosophila melanogaster protein sequences from the fruitfly genome; Encephalitozoon cuniculi protein sequences from the E.cuniculi genome; Fugu rubripes protein sequences from the pufferfish genome; Guillardia theta protein sequences from the G.theta genome; Homo sapiens protein sequences from the human genome; Mus musculus protein sequences from the mouse genome; Neurospora crassa protein sequences from the N.crassa genome; Oryza sativa protein sequences from the rice genome; Plasmodium falciparum protein sequences from the P.falciparum genome; Rattus norvegicus protein sequences from the rat genome; Saccharomyces cerevisiae protein sequences from the yeast genome; Schizosaccharomyces pombe protein sequences from the yeast genome
Z
TetrapodTraits Database
data.niaid.nih.gov
data-staging.niaid.nih.gov
Updated Oct 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Moura, Mario R.; Ceron, Karoline; Guedes, Jhonny J. M.; Chen-Zhao, Rosana; Sica, Yanina; Hart, Julie; Dorman, Wendy; Portmann, Julia M.; Gonzalez-del-Pliego, Pamela; Ranipeta, Ajay; Catenazzi, Alessandro; Werneck, Fernanda; Toledo, Luis Felipe; Upham, Nathan; Tonini, Joao F. R.; Colston, Timothy J.; Guralnick, Robert; Bowie, Rauri C. K.; Pyron, R. Alexander; Jetz, Walter (2024). TetrapodTraits Database [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10530617
Explore at:
Dataset updated
Oct 9, 2024
Dataset provided by
University of Illinois Urbana-Champaign
National Institute of Amazonian Research
Florida International University
University of California, Berkeley
State University of New York
Universidade Federal de Goiás
Yale University
University of Richmond
George Washington University
Universidade de Évora
University of Puerto Rico-Mayaguez
Universidade Federal do Ceará
Arizona State University
Universidade Estadual de Campinas (UNICAMP)
University of Florida
Authors
Moura, Mario R.; Ceron, Karoline; Guedes, Jhonny J. M.; Chen-Zhao, Rosana; Sica, Yanina; Hart, Julie; Dorman, Wendy; Portmann, Julia M.; Gonzalez-del-Pliego, Pamela; Ranipeta, Ajay; Catenazzi, Alessandro; Werneck, Fernanda; Toledo, Luis Felipe; Upham, Nathan; Tonini, Joao F. R.; Colston, Timothy J.; Guralnick, Robert; Bowie, Rauri C. K.; Pyron, R. Alexander; Jetz, Walter
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abstract

Tetrapods (amphibians, reptiles, birds and mammals) are model systems for global biodiversity science, but continuing data gaps, limited data standardisation, and ongoing flux in taxonomic nomenclature constrain integrative research on this group and potentially cause biassed inference. We combined and harmonised taxonomic, spatial, phylogenetic, and attribute data with phylogeny-based multiple imputation to provide a comprehensive data resource (TetrapodTraits 1.0.0) that includes values, predictions, and sources for body size, activity time, micro- and macrohabitat, ecosystem, threat status, biogeography, insularity, environmental preferences and human influence, for all 33,281 tetrapod species covered in recent fully sampled phylogenies. We assess gaps and biases across taxa and space, finding that shared data missing in attribute values increased with taxon-level completeness and richness across clades. Prediction of missing attribute values using multiple imputation revealed substantial changes in estimated macroecological patterns. These results highlight biases incurred by non-random missingness and strategies to best address them. While there is an obvious need for further data collection and updates, our phylogeny-informed database of tetrapod traits can support a more comprehensive representation of tetrapod species and their attributes in ecology, evolution, and conservation research.

Additional Information: This work is output of the VertLife project. To flag erros, provide updates, or leave other comments, please go to vertlife.org. We aim to develop the database into a living resource at vertlife.org and your feedback is essential to improve data quality and support community use.

Version 1.0.1 (25 May 2024). This minor release addresses a spelling error in the file Tetrapod_360.csv. The error involves replacing white-space characters with underscore characters in the field Scientific.Name to match the spelling used in the file TetrapodTraits_1.0.0.csv. These corrections affect only 102 species considered extinct and 13 domestic species (Bos_frontalis, Bos_grunniens, Bos_indicus, Bos_taurus, Camelus_bactrianus, Camelus_dromedarius, Capra_hircus, Cavia_porcellus, Equus_caballus, Felis_catus, Lama_glama, Ovis_aries, Vicugna_pacos). All extinct and domestic species in TetrapodTraits have their binomial names separated by underscore symbols instead of white space. Additionally, we have added the file GridCellShapefile.zip, which contains the shapefile required to map species presence across the 110 × 110 km equal area grid cells (this file was previously provided through an External Source here).

Version 1.0.0 (19 April 2024). TetrapodTraits, the full phylogenetically coherent database we developed, is being made publicly available to support a range of research applications in ecology, evolution, and conservation and to help minimise the impacts of biassed data in this model system. The database includes 24 species-level attributes linked to their respective sources across 33,281 tetrapod species. Specific fields clearly label data sources and imputations in the TetrapodTraits, while additional tables record the 10K values per missing entry per species.

Taxonomy – includes 8 attributes that inform scientific names and respective higher-level taxonomic ranks, authority name, and year of species description. Field names: Scientific.Name, Genus, Family, Suborder, Order, Class, Authority, and YearOfDescription.

Phylogenetic tree – includes 2 attributes that notify which fully-sampled phylogeny contains the species, along with whether the species placement was imputed or not in the phylogeny. Field names: TreeTaxon, TreeImputed.

Body size – includes 7 attributes that inform length, mass, and data sources on species sizes, and details on the imputation of species length or mass. Field names: BodyLength_mm, LengthMeasure, ImputedLength, SourceBodyLength, BodyMass_g, ImputedMass, SourceBodyMass.

Activity time – includes 5 attributes that describe period of activity (e.g., diurnal, fossorial) as dummy (binary) variables, data sources, details on the imputation of species activity time, and a nocturnality score. Field names: Diu, Noc, ImputedActTime, SourceActTime, Nocturnality.

Microhabitat – includes 8 attributes covering habitat use (e.g., fossorial, terrestrial, aquatic, arboreal, aerial) as dummy (binary) variables, data sources, details on the imputation of microhabitat, and a verticality score. Field names: Fos, Ter, Aqu, Arb, Aer, ImputedHabitat, SourceHabitat, Verticality.

Macrohabitat – includes 19 attributes that reflect major habitat types according to the IUCN classification, the sum of major habitats, data source, and details on the imputation of macrohabitat. Field names: MajorHabitat_1 to MajorHabitat_10, MajorHabitat_12 to MajorHabitat_17, MajorHabitatSum, ImputedMajorHabitat, SourceMajorHabitat. MajorHabitat_11, representing the marine deep ocean floor (unoccupied by any species in our database), is not included here.

Ecosystem – includes 6 attributes covering species ecosystem (e.g., terrestrial, freshwater, marine) as dummy (binary) variables, the sum of ecosystem types, data sources, and details on the imputation of ecosystem. Field names: EcoTer, EcoFresh, EcoMar, EcosystemSum, ImputedEcosystem, SourceEcosystem.

Threat status – includes 3 attributes that inform the assessed threat statuses according to IUCN red list and related literature. Field names: IUCN_Binomial, AssessedStatus, SourceStatus.

RangeSize – the number of 110×110 grid cells covered by the species range map. Data derived from MOL.

Latitude – coordinate centroid of the species range map.

Longitude – coordinate centroid of the species range map.

Biogeography – includes 8 attributes that present the proportion of species range within each WWF biogeographical realm. Field names: Afrotropic, Australasia, IndoMalay, Nearctic, Neotropic, Oceania, Palearctic, Antarctic.

Insularity – includes 2 attributes that notify if a species is insular endemic (binary, 1 = yes, 0 = no), followed by the respective data source. Field names: Insularity, SourceInsularity.

AnnuMeanTemp – Average within-range annual mean temperature (Celsius degree). Data derived from CHELSA v. 1.2.

AnnuPrecip – Average within-range annual precipitation (mm). Data derived from CHELSA v. 1.2.

TempSeasonality – Average within-range temperature seasonality (Standard deviation × 100). Data derived from CHELSA v. 1.2.

PrecipSeasonality – Average within-range precipitation seasonality (Coefficient of Variation). Data derived from CHELSA v. 1.2.

Elevation – Average within-range elevation (metres). Data derived from topographic layers in EarthEnv.

ETA50K – Average within-range estimated time to travel to cities with a population >50K in the year 2015. Data from Nelson et al. (2019).

HumanDensity – Average within-range human population density in 2017. Data derived from HYDE v. 3.2.

PropUrbanArea – Proportion of species range map covered by built-up area, such as towns, cities, etc. at year 2017. Data derived from HYDE v. 3.2.

PropCroplandArea – Proportion of species range map covered by cropland area, identical to FAO's category 'Arable land and permanent crops' at year 2017. Data derived from HYDE v. 3.2.

PropPastureArea – Proportion of species range map covered by cropland, defined as Grazing land with an aridity index > 0.5, assumed to be more intensively managed (converted in climate models) at year 2017. Data derived from HYDE v. 3.2.

PropRangelandArea – Proportion of species range map covered by rangeland, defined as Grazing land with an aridity index < 0.5, assumed to be less or not managed (not converted in climate models) at year 2017. Data derived from HYDE v. 3.2.

File content

All files use UTF-8 encoding.

ImputedSets.zip – the phylogenetic multiple imputation framework applied to the TetrapodTraits database produced 10,000 imputed values per missing data entry (= 100 phylogenetic trees x 10 validation-folds x 10 multiple imputations). These imputations were specifically developed for four fundamental natural history traits: Body length, Body mass, Activity time, and Microhabitat. To facilitate the evaluation of each imputed value in a user-friendly format, we offer 10,000 tables containing both observed and imputed data for the 33,281 species in the TetrapodTraits database. Each table encompasses information about the four targeted natural history traits, along with designated fields (e.g., ImputedMass) that clearly indicate whether the trait value provided (e.g., BodyMass_g) corresponds to observed (e.g., ImputedMass = 0) or imputed (e.g., ImputedMass = 1) data. Given that the complete set of 10,000 tables necessitates nearly 17GB of storage space, we have organized sets of 1,000 tables into separate zip files to streamline the download process.

ImputedSets_1K.zip, imputations for trees 1 to 10.

ImputedSets_2K.zip, imputations for trees 11 to 20.

ImputedSets_3K.zip, imputations for trees 21 to 30.

ImputedSets_4K.zip, imputations for trees 31 to 40.

ImputedSets_5K.zip, imputations for trees 41 to 50.

ImputedSets_6K.zip, imputations for trees 51 to 60.

ImputedSets_7K.zip, imputations for trees 61 to 70.

ImputedSets_8K.zip, imputations for trees 71 to 80.

ImputedSets_9K.zip, imputations for trees 81 to 90.

ImputedSets_10K.zip, imputations for trees 91 to 100.

TetrapodTraits_1.0.0.csv – the complete TetrapodTraits database, with missing data entries in natural history traits (body length, body mass, activity time, and microhabitat) replaced by the average across the 10K imputed values obtained through phylogenetic multiple imputation. Please note that imputed microhabitat (attribute fields: Fos, Ter, Aqu, Arb, Aer) and imputed activity time (attribute fields: Diu, Noc) are continuous variables within the 0-1 range interval. At the user's
d
Health and Retirement Study (HRS)
search.dataone.org
dataverse.harvard.edu
Updated Nov 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Damico, Anthony (2023). Health and Retirement Study (HRS) [Dataset]. http://doi.org/10.7910/DVN/ELEKOY
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/ELEKOY
Dataset updated
Nov 21, 2023
Dataset provided by
Harvard Dataverse
Authors
Damico, Anthony
Description
analyze the health and retirement study (hrs) with r the hrs is the one and only longitudinal survey of american seniors. with a panel starting its third decade, the current pool of respondents includes older folks who have been interviewed every two years as far back as 1992. unlike cross-sectional or shorter panel surveys, respondents keep responding until, well, death d o us part. paid for by the national institute on aging and administered by the university of michigan's institute for social research, if you apply for an interviewer job with them, i hope you like werther's original. figuring out how to analyze this data set might trigger your fight-or-flight synapses if you just start clicking arou nd on michigan's website. instead, read pages numbered 10-17 (pdf pages 12-19) of this introduction pdf and don't touch the data until you understand figure a-3 on that last page. if you start enjoying yourself, here's the whole book. after that, it's time to register for access to the (free) data. keep your username and password handy, you'll need it for the top of the download automation r script. next, look at this data flowchart to get an idea of why the data download page is such a righteous jungle. but wait, good news: umich recently farmed out its data management to the rand corporation, who promptly constructed a giant consolidated file with one record per respondent across the whole panel. oh so beautiful. the rand hrs files make much of the older data and syntax examples obsolete, so when you come across stuff like instructions on how to merge years, you can happily ignore them - rand has done it for you. the health and retirement study only includes noninstitutionalized adults when new respondents get added to the panel (as they were in 1992, 1993, 1998, 2004, and 2010) but once they're in, they're in - respondents have a weight of zero for interview waves when they were nursing home residents; but they're still responding and will continue to contribute to your statistics so long as you're generalizing about a population from a previous wave (for example: it's possible to compute "among all americans who were 50+ years old in 1998, x% lived in nursing homes by 2010"). my source for that 411? page 13 of the design doc. wicked. this new github repository contains five scripts: 1992 - 2010 download HRS microdata.R loop through every year and every file, download, then unzip everything in one big party impor t longitudinal RAND contributed files.R create a SQLite database (.db) on the local disk load the rand, rand-cams, and both rand-family files into the database (.db) in chunks (to prevent overloading ram) longitudinal RAND - analysis examples.R connect to the sql database created by the 'import longitudinal RAND contributed files' program create tw o database-backed complex sample survey object, using a taylor-series linearization design perform a mountain of analysis examples with wave weights from two different points in the panel import example HRS file.R load a fixed-width file using only the sas importation script directly into ram with < a href="http://blog.revolutionanalytics.com/2012/07/importing-public-data-with-sas-instructions-into-r.html">SAScii parse through the IF block at the bottom of the sas importation script, blank out a number of variables save the file as an R data file (.rda) for fast loading later replicate 2002 regression.R connect to the sql database created by the 'import longitudinal RAND contributed files' program create a database-backed complex sample survey object, using a taylor-series linearization design exactly match the final regression shown in this document provided by analysts at RAND as an update of the regression on pdf page B76 of this document . click here to view these five scripts for more detail about the health and retirement study (hrs), visit: michigan's hrs homepage rand's hrs homepage the hrs wikipedia page a running list of publications using hrs notes: exemplary work making it this far. as a reward, here's the detailed codebook for the main rand hrs file. note that rand also creates 'flat files' for every survey wave, but really, most every analysis you c an think of is possible using just the four files imported with the rand importation script above. if you must work with the non-rand files, there's an example of how to import a single hrs (umich-created) file, but if you wish to import more than one, you'll have to write some for loops yourself. confidential to sas, spss, stata, and sudaan users: a tidal wave is coming. you can get water up your nose and be dragged out to sea, or you can grab a surf board. time to transition to r. :D
o
Harmonized Database of Western U.S. Water Rights (HarDWR)
osti.gov
Updated Nov 15, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Caccese, Robert; Fisher-Vanden, Karen; Fowler, Lara; Grogan, Danielle; Lammers, Richard; Lisk, Matthew; Olmstead, Sheila; Peklak, Darrah; Zheng, Jiameng; Zuidema, Shan (2023). Harmonized Database of Western U.S. Water Rights (HarDWR) [Dataset]. https://www.osti.gov/dataexplorer/biblio/dataset/2205619
Explore at:
Dataset updated
Nov 15, 2023
Dataset provided by
MultiSector Dynamics - Living, Intuitive, Value-adding, Environment
USDOE Office of Science (SC), Biological and Environmental Research (BER)
Authors
Caccese, Robert; Fisher-Vanden, Karen; Fowler, Lara; Grogan, Danielle; Lammers, Richard; Lisk, Matthew; Olmstead, Sheila; Peklak, Darrah; Zheng, Jiameng; Zuidema, Shan
Area covered
United States, Western United States
Description
From Lisk et al. (in review): "In the arid and semi-arid western U.S., access to water is regulated through a legal system of water rights. Individuals, companies, organizations, municipalities, and tribal entities have documents that declare their water rights. State water regulatory agencies collate and maintain these records, which can be used in legal disputes over access to water. While these records are publicly available data in all western U.S. states, the data have not yet been readily available in digital form from all states. Furthermore, there are many differences in data format, terminology, and definitions between state water regulatory agencies. Here, we have collected water rights data from 11 western U.S. state agencies, harmonized terminology and use definitions, formatted them consistently, and tied them to a western U.S.-wide shapefile of water administrative boundaries. We demonstrate how these data enable consistent regional-scale western U.S. hydrologic and economic modeling."
r
Arthropod Kraken2 Database v1
demo.researchdata.se
researchdata.se
Updated Aug 18, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Samantha López Clinton; Tom van der Valk (2025). Arthropod Kraken2 Database v1 [Dataset]. http://doi.org/10.17044/SCILIFELAB.29666605
Explore at:
Unique identifier
https://doi.org/10.17044/SCILIFELAB.29666605
Dataset updated
Aug 18, 2025
Dataset provided by
Swedish Museum of Natural History
Authors
Samantha López Clinton; Tom van der Valk
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Kraken2 Arthopod Reference Database v.1Kraken2 (v2.1.2) database containing all 2,593 reference assemblies for Arthropoda available on NCBI as of March 2023.

This database was built for and used in the analysis of shotgun sequencing data of bulkDNA from Malaise trap samples collected by the Insect Biome Atlas, in the context of the manuscript "Small Bugs, Big Data: Metagenomics for arthropod biodiversity monitoring" by authors: López Clinton Samantha, Iwaszkiewicz-Eggebrecht Ela, Miraldo Andreia, Goodsell Robert, Webster Mathew T, Ronquist Fredrik, van der Valk Tom (for submission to Ecology and Evolution).

For custom database building, Kraken2 requires all headers in reference assembly fasta files to be annotated with "kraken:taxid|XXX" at the end of each header. Where "XXX" is the corresponding National Center for Biotechnology Information (NCBI) taxID of the species. The code used to add the taxID information to each fasta file header, and update the accession2taxid.map file required by Kraken2 for database building, is available in this GitHub repository (https://github.com/SamanthaLop/Small_Bugs_Big_Data) (also linked under "Related Materials" below).

ContentBelow is a list of the files in this item (in addition to the README and MANIFEST files), and their description. The first three files (marked with a *) are required to run Kraken2 classifications using the database.

* hash.k2d.gz - A hash file with all minimiser to taxon mappings (855 GB).

* opts.k2d - A file containing all options used when building the Kraken2 database (64 B).

* taxo.k2d - A file containing the taxonomy information used to build the database (385.9 KB).

seqid2taxid.map.gz - A file containing contig accession numbers and their corresponding taxids (810.6 MB). Note that this file is needed by Kraken2 when building the database, and as it was updated during custom building, it has been included for reference, but it is not required to use the database for classification.

genome_assembly_metadata.tsv - NCBI-generated table (tsv format, gzipped) of all reference assemblies for Arthropoda as of March 2023, which were used in the database construction. This includes columns: Assembly Accession, Assembly Name, Organism Name, Organism Infraspecific Names Breed, Organism Infraspecific Names Strain, Organism Infraspecific Names Cultival, Organism Infraspecific Names Ecotype, Organism Infraspecific Names Isolate, Organism Infraspecific Names Sex, Annotation Name, Assembly Stats Total Sequence Length, Assembly Level, Assembly Submission, and WGS project accession. How to use the database- Download the hash.k2d.gz, opts.k2d, and taxo.k2d files to the same directory (e.g. /PATH/TO/DATABASE/).

Unzip the hash.k2d.gz file.

Install or load Kraken2 to run classification on sequencing data using the database.

When running Kraken2, indicate the path to the directory (not the individual files) with the --db flag (e.g. kraken2 --db /PATH/TO/DATABASE/ ...). Note that the whole database must be loaded into memory by Kraken2 to be able to classify any sequencing reads, so ensure you have access to enough memory before running (the uncompressed hash file is around 1.1 TB).

We also recommend using the Kraken2 option --memory-mapping, as it ensures the database is loaded once for all samples, instead of once for each individual sample, saving considerable time and resources.

For more information on using Kraken2, see the Kraken2 wiki manual (https://github.com/DerrickWood/kraken2/wiki/Manual) .

This database was built by Samantha López Clinton (samantha.lopezclinton@nrm) and Tom van der Valk (tom.vandervalk@nrm.se).
Comprehensive Medical Q&A Dataset
kaggle.com
zip
Updated Nov 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Comprehensive Medical Q&A Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/comprehensive-medical-q-a-dataset
Explore at:
zip(5126941 bytes)Available download formats
Dataset updated
Nov 24, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Comprehensive Medical Q&A Dataset

Unlocking Healthcare Data with Natural Language Processing

By Huggingface Hub [source]

About this dataset

The MedQuad dataset provides a comprehensive source of medical questions and answers for natural language processing. With over 43,000 patient inquiries from real-life situations categorized into 31 distinct types of questions, the dataset offers an invaluable opportunity to research correlations between treatments, chronic diseases, medical protocols and more. Answers provided in this database come not only from doctors but also other healthcare professionals such as nurses and pharmacists, providing a more complete array of responses to help researchers unlock deeper insights within the realm of healthcare. This incredible trove of knowledge is just waiting to be mined - so grab your data mining equipment and get exploring!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

In order to make the most out of this dataset, start by having a look at the column names and understanding what information they offer: qtype (the type of medical question), Question (the question in itself), and Answer (the expert response). The qtype column will help you categorize the dataset according to your desired question topics. Once you have filtered down your criteria as much as possible using qtype, it is time to analyze the data. Start by asking yourself questions such as “What treatments do most patients search for?” or “Are there any correlations between chronic conditions and protocols?” Then use simple queries such as SELECT Answer FROM MedQuad WHERE qtype='Treatment' AND Question LIKE '%pain%' to get closer to answering those questions.

Once you have obtained new insights about healthcare based on the answers provided in this dynmaic data set - now it’s time for action! Use all that newfound understanding about patient needs in order develop educational materials and implement any suggested changes necessary. If more criteria are needed for querying this data set see if MedQuad offers additional columns; sometimes extra columns may be added periodically that could further enhance analysis capabilities; look out for notifications if these happen.

Finally once making an impact with the use case(s) - don't forget proper citation etiquette; give credit where credit is due!

Research Ideas

Developing medical diagnostic tools that use natural language processing (NLP) to better identify and diagnose health conditions in patients.

Creating predictive models to anticipate treatment options for different medical conditions using machine learning techniques.

Leveraging the dataset to build chatbots and virtual assistants that are able to answer a broad range of questions about healthcare with expert-level accuracy

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: train.csv | Column name | Description | |:--------------|:------------------------------------------------------| | qtype | The type of medical question. (String) | | Question | The medical question posed by the patient. (String) | | Answer | The expert response to the medical question. (String) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter...
zenodo.org
bz2
Updated Mar 15, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana (2021). Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks [Dataset]. http://doi.org/10.5281/zenodo.2592524
Explore at:
bz2Available download formats
Unique identifier
https://doi.org/10.5281/zenodo.2592524
Dataset updated
Mar 15, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub.

Paper: https://2019.msrconf.org/event/msr-2019-papers-a-large-scale-study-about-quality-and-reproducibility-of-jupyter-notebooks

This repository contains two files:

dump.tar.bz2

jupyter_reproducibility.tar.bz2

The dump.tar.bz2 file contains a PostgreSQL dump of the database, with all the data we extracted from the notebooks.

The jupyter_reproducibility.tar.bz2 file contains all the scripts we used to query and download Jupyter Notebooks, extract data from them, and analyze the data. It is organized as follows:

analyses: this folder has all the notebooks we use to analyze the data in the PostgreSQL database.

archaeology: this folder has all the scripts we use to query, download, and extract data from GitHub notebooks.

paper: empty. The notebook analyses/N12.To.Paper.ipynb moves data to it

In the remaining of this text, we give instructions for reproducing the analyses, by using the data provided in the dump and reproducing the collection, by collecting data from GitHub again.

Reproducing the Analysis

This section shows how to load the data in the database and run the analyses notebooks. In the analysis, we used the following environment:

Ubuntu 18.04.1 LTS
PostgreSQL 10.6
Conda 4.5.11
Python 3.7.2
PdfCrop 2012/11/02 v1.38

First, download dump.tar.bz2 and extract it:

tar -xjf dump.tar.bz2

It extracts the file db2019-03-13.dump. Create a database in PostgreSQL (we call it "jupyter"), and use psql to restore the dump:

psql jupyter < db2019-03-13.dump

It populates the database with the dump. Now, configure the connection string for sqlalchemy by setting the environment variable JUP_DB_CONNECTTION:

export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter";

Download and extract jupyter_reproducibility.tar.bz2:

tar -xjf jupyter_reproducibility.tar.bz2

Create a conda environment with Python 3.7:

conda create -n analyses python=3.7 conda activate analyses

Go to the analyses folder and install all the dependencies of the requirements.txt

cd jupyter_reproducibility/analyses pip install -r requirements.txt

For reproducing the analyses, run jupyter on this folder:

jupyter notebook

Execute the notebooks on this order:

Index.ipynb

N0.Repository.ipynb

N1.Skip.Notebook.ipynb

N2.Notebook.ipynb

N3.Cell.ipynb

N4.Features.ipynb

N5.Modules.ipynb

N6.AST.ipynb

N7.Name.ipynb

N8.Execution.ipynb

N9.Cell.Execution.Order.ipynb

N10.Markdown.ipynb

N11.Repository.With.Notebook.Restriction.ipynb

N12.To.Paper.ipynb

Reproducing or Expanding the Collection

The collection demands more steps to reproduce and takes much longer to run (months). It also involves running arbitrary code on your machine. Proceed with caution.

Requirements

This time, we have extra requirements:

All the analysis requirements
lbzip2 2.5
gcc 7.3.0
Github account
Gmail account

Environment

First, set the following environment variables:

export JUP_MACHINE="db"; # machine identifier export JUP_BASE_DIR="/mnt/jupyter/github"; # place to store the repositories export JUP_LOGS_DIR="/home/jupyter/logs"; # log files export JUP_COMPRESSION="lbzip2"; # compression program export JUP_VERBOSE="5"; # verbose level export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; # sqlchemy connection export JUP_GITHUB_USERNAME="github_username"; # your github username export JUP_GITHUB_PASSWORD="github_password"; # your github password export JUP_MAX_SIZE="8000.0"; # maximum size of the repositories directory (in GB) export JUP_FIRST_DATE="2013-01-01"; # initial date to query github export JUP_EMAIL_LOGIN="gmail@gmail.com"; # your gmail address export JUP_EMAIL_TO="target@email.com"; # email that receives notifications export JUP_OAUTH_FILE="~/oauth2_creds.json" # oauth2 auhentication file export JUP_NOTEBOOK_INTERVAL=""; # notebook id interval for this machine. Leave it in blank export JUP_REPOSITORY_INTERVAL=""; # repository id interval for this machine. Leave it in blank export JUP_WITH_EXECUTION="1"; # run execute python notebooks export JUP_WITH_DEPENDENCY="0"; # run notebooks with and without declared dependnecies export JUP_EXECUTION_MODE="-1"; # run following the execution order export JUP_EXECUTION_DIR="/home/jupyter/execution"; # temporary directory for running notebooks export JUP_ANACONDA_PATH="~/anaconda3"; # conda installation path export JUP_MOUNT_BASE="/home/jupyter/mount_ghstudy.sh"; # bash script to mount base dir export JUP_UMOUNT_BASE="/home/jupyter/umount_ghstudy.sh"; # bash script to umount base dir export JUP_NOTEBOOK_TIMEOUT="300"; # timeout the extraction # Frequenci of log report export JUP_ASTROID_FREQUENCY="5"; export JUP_IPYTHON_FREQUENCY="5"; export JUP_NOTEBOOKS_FREQUENCY="5"; export JUP_REQUIREMENT_FREQUENCY="5"; export JUP_CRAWLER_FREQUENCY="1"; export JUP_CLONE_FREQUENCY="1"; export JUP_COMPRESS_FREQUENCY="5"; export JUP_DB_IP="localhost"; # postgres database IP

Then, configure the file ~/oauth2_creds.json, according to yagmail documentation: https://media.readthedocs.org/pdf/yagmail/latest/yagmail.pdf

Configure the mount_ghstudy.sh and umount_ghstudy.sh scripts. The first one should mount the folder that stores the directories. The second one should umount it. You can leave the scripts in blank, but it is not advisable, as the reproducibility study runs arbitrary code on your machine and you may lose your data.

Scripts

Download and extract jupyter_reproducibility.tar.bz2:

tar -xjf jupyter_reproducibility.tar.bz2

Install 5 conda environments and 5 anaconda environments, for each python version. In each of them, upgrade pip, install pipenv, and install the archaeology package (Note that it is a local package that has not been published to pypi. Make sure to use the -e option):

Conda 2.7

conda create -n raw27 python=2.7 -y conda activate raw27 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Anaconda 2.7

conda create -n py27 python=2.7 anaconda -y conda activate py27 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.4

It requires a manual jupyter and pathlib2 installation due to some incompatibilities found on the default installation.

conda create -n raw34 python=3.4 -y conda activate raw34 conda install jupyter -c conda-forge -y conda uninstall jupyter -y pip install --upgrade pip pip install jupyter pip install pipenv pip install -e jupyter_reproducibility/archaeology pip install pathlib2

Anaconda 3.4

conda create -n py34 python=3.4 anaconda -y conda activate py34 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.5

conda create -n raw35 python=3.5 -y conda activate raw35 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Anaconda 3.5

It requires the manual installation of other anaconda packages.

conda create -n py35 python=3.5 anaconda -y conda install -y appdirs atomicwrites keyring secretstorage libuuid navigator-updater prometheus_client pyasn1 pyasn1-modules spyder-kernels tqdm jeepney automat constantly anaconda-navigator conda activate py35 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.6

conda create -n raw36 python=3.6 -y conda activate raw36 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Anaconda 3.6

conda create -n py36 python=3.6 anaconda -y conda activate py36 conda install -y anaconda-navigator jupyterlab_server navigator-updater pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.7

<code
Transport Indicator Database Survey 2012 - Ghana
microdata.statsghana.gov.gh
Updated Mar 14, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ghana Statistical Service (GSS) (2016). Transport Indicator Database Survey 2012 - Ghana [Dataset]. https://microdata.statsghana.gov.gh/index.php/catalog/82
Explore at:
Dataset updated
Mar 14, 2016
Dataset provided by
Ghana Statistical Services
Authors
Ghana Statistical Service (GSS)
Time period covered
2012
Area covered
Ghana
Description
Abstract

The efficient development, maintenance and administration of transport infrastructure and services are critical to the socio-economic development of any country. Scarce government resources and support from donor funds are required to provide these essential services to all sectors for the economic development of the country and for attaining equity and the participation of the populace in the creation of wealth and reduction of poverty.

To ascertain the effectiveness of implementation of policies and development programs, for transport related infrastructure and services key performance indicators are required. The data for developing these performance indicators must be collected on a sustainable basis by the various sectors for collation and analysis. Although most of the relevant basic data exist in many establishments, these are often scattered and are not collated nor disseminated in any structured manner. The Transportation sector is no exception. A recent study of the Ghana Road Sub-sector Programme finds that there is an urgent need to reinforce the monitoring system of MRT as performance indicators have only partially been collected and used; the road condition mix is monitored on an annual basis while other basic performance indicators are lacking. A good monitoring system will help improve the policy formulation within the sub-sector while its absence may result in a major fund funding reduction because the contribution to national development objectives, such as poverty alleviation, cannot be substantiated and demonstrated.

Objectives of survey The development objective of the TSPS-II as defined in the Ghana Poverty Reduction Strategy (GPRS), to sustain economic growth through the provision of safe, reliable, efficient and affordable services for all transport users. The focus of the transport sector under the GPRS is to provide access through better distribution of the transport network with special emphasis on high poverty areas in order to reduce transport disparities between the urban and rural communities. The household survey is a component of a bigger programme which will serve as a reliable and sustainable one-stop shop for all the data and performance indicators for the transport sector. The immediate objective of the sub-component is to improve the effectiveness of implementation of policies and development programmes for the transport sector, including related infrastructure and services. The direct aim of the sub-component will be the collection, processing, analysis, documentation and dissemination of transport related data, which will be useful for:

Transport planning and policy formulation;

Impact assessment, monitoring and evaluation of policies and programmes;

Measuring the contribution of the transport to the achievement of MDGs;

Impact assessment of the transport sector on poverty alleviation and the general standard of living;

Comparisons of performance of the transport sector over time and between countries for the purpose of drawing lessons and giving an indication of where interventions are necessary;

Provision of a comprehensive database for justification of programmes and projects under the Multi-Donor Budgetary Support (MDBS).

Geographic coverage

National level Region Level

Analysis unit

Household and Individual

Universe

The survey covered all household members (Usual residents)

Kind of data

Sample survey data [ssd]

Sampling procedure

The sample was representative of all households in Ghana. To achieve the study objectives, the sample size chosen was based on the type of variables under consideration, the required precision of the survey estimates and available resources. Taking all of these into consideration, a sample size of 6,000 households was deemed sufficient to achieve the survey objectives. This was enough to yield reliable estimates of all the important survey variables as well as being manageable to control and minimize non-sampling errors.

Stratification and Sample Selection Procedures The total list of the Enumeration Areas (EAs) from the demarcation for the 2010 Population and Housing Census formed the sampling frame for the Phase II of the Transport Indicators Survey. The sampling frame was stratified into urban/rural residence and the 10 administrative regions of the country for the selection of the sample. The sample was selected in two stages.

The first stage selection involved the systematic selection of 400 EAs with probability proportional to size, the measure of size being the number of households in each EA. The second stage selection involved the systematic selection of 15 households from each EA. See Appendix A for more details on the sample design.

Sampling deviation

No deviations

Mode of data collection

Face-to-face [f2f]

Research instrument

The questionnaire had the following sections:

Section A: a household roster which collected basic information on all households members and household characteristics to determine eligible household members

Section B: an education section which was administered to household members aged 3 years and older on the use of transport services to school

Section C: a health section that was used to collect information on all household members on access and the use of transport services to health facilities

Section D: an economic activity section administered to household members 7 years and older to collect information on their economic activities and the use of transport services a market access section administered to household members engaged in agricultural activities to collect information on access to transport services for sale of farm produce

Section E: a general transport services section administered to all household members on the access and use of various modes of transport.

Section F: a general transport services section administered to all households and use of various modes of transport.

Cleaning operations

Control mechanisms were inbuilt in the data capturing application. Range checks and skip patterns were incorporated into the data capturing application. Partial double entry was done in order to compare and correct errors. After data capture secondary editng was done in the form of consistency checks. CSPro 4.1 was used to capture the data.

Response rate

National: (5996/6000)*100=99.93%

By Regions: Western=99.8% Central= 100.0% Greater Accra= 100.0% Volta = 99.5% Eastern=100.0% Ashanti = 100.0% Brong Ahafo = 100.0% Northern = 100.0% Upper East = 100.0% Upper West = 100.0%

Region Hhs completed Hhs Expected Response rate
Western 569 570 99.8
Central 510 510 100.0
Greater Accra 855 855 100.0
Volta 567 570 99.5
Eastern 705 705 100.0
Ashanti 1,125 1,125 100.0
Brong Ahafo 585 585 100.0
Northern 615 615 100.0
Upper East 285 285 100.0
Upper West 180 180 100.0
Total 5,996 6,000 99.9

Causes of non response Region
Result of Interview Western Volta Total
Refused 1 0 1
No HHold Member at Home 0 2 2
Other 0 1 1
Total 1 3 4

Sampling error estimates

Sample errors was calculated but not in the report.

Data appraisal

No other forms of data appraisal
w
The Nationwide Readmissions Database
datacatalog.library.wayne.edu
Updated Jun 19, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Agency for Healthcare Research and Quality (AHRQ) (2020). The Nationwide Readmissions Database [Dataset]. https://datacatalog.library.wayne.edu/dataset/the-nationwide-readmissions-database
Explore at:
Dataset updated
Jun 19, 2020
Dataset provided by
U.S. Agency for Healthcare Research and Quality (AHRQ)
Description
The Nationwide Readmissions Database (NRD) is a unique and powerful database designed to support various types of analyses of national readmission rates for all patients regardless of the expected payer for the hospital stay. The NRD includes discharges for patients with and without repeat hospital visits in a year and those who have died in the hospital. Repeat stays may or may not be related. The criteria to determine the relationship between hospital admissions is left to the analyst using the NRD. This database addresses a large gap in healthcare data - the lack of nationally representative information on hospital readmissions for all ages.
d
Get access of 69 Million Professional's Email Database
datarade.ai
.json, .csv
Updated Sep 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bytescraper (2025). Get access of 69 Million Professional's Email Database [Dataset]. https://datarade.ai/data-products/get-access-of-69-million-professional-s-email-database-b2b-email-databases
Explore at:
.json, .csvAvailable download formats
Dataset updated
Sep 12, 2025
Dataset authored and provided by
Bytescraper
Area covered
Germany, Switzerland, Japan, Spain, Canada, India, New Zealand, Italy, South Africa, United Kingdom
Description
A good DATA is crucial for any business or organization to grow the network. This is because all relevant details about the company and user are stored in the database. Your companies have benefited from using our email database to extract their prospect's details.

It is a well-known fact that LinkedIn gives you the opportunity to expand your business network. You can easily connect with your prospects, directly or through mutual connections, by using search keywords related to their name, company, profile, address, etc. However, we're a leading data provider, with us you do not need to do such a thing. Our Professional's email database contains all the necessary business information from your prospects. There are several ways to access them (especially email addresses and phone numbers).

With our service, you can reach over 69 million records in 200+ countries. Our database is well organized and keeps information easily accessible, so you can use it. Easily increase your sales with reliable LinkedIn data that connects you directly to your goal, here we have worked hard to supply quality, reliable, sustainable email databases.
mimic-iii-clinical-database-demo-1.4
kaggle.com
zip
Updated Apr 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Montassar bellah (2025). mimic-iii-clinical-database-demo-1.4 [Dataset]. https://www.kaggle.com/datasets/montassarba/mimic-iii-clinical-database-demo-1-4
Explore at:
zip(11100065 bytes)Available download formats
Dataset updated
Apr 1, 2025
Authors
Montassar bellah
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Abstract MIMIC-III is a large, freely-available database comprising deidentified health-related data associated with over 40,000 patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012 [1]. The MIMIC-III Clinical Database is available on PhysioNet (doi: 10.13026/C2XW26). Though deidentified, MIMIC-III contains detailed information regarding the care of real patients, and as such requires credentialing before access. To allow researchers to ascertain whether the database is suitable for their work, we have manually curated a demo subset, which contains information for 100 patients also present in the MIMIC-III Clinical Database. Notably, the demo dataset does not include free-text notes.

Background In recent years there has been a concerted move towards the adoption of digital health record systems in hospitals. Despite this advance, interoperability of digital systems remains an open issue, leading to challenges in data integration. As a result, the potential that hospital data offers in terms of understanding and improving care is yet to be fully realized.

MIMIC-III integrates deidentified, comprehensive clinical data of patients admitted to the Beth Israel Deaconess Medical Center in Boston, Massachusetts, and makes it widely accessible to researchers internationally under a data use agreement. The open nature of the data allows clinical studies to be reproduced and improved in ways that would not otherwise be possible.

The MIMIC-III database was populated with data that had been acquired during routine hospital care, so there was no associated burden on caregivers and no interference with their workflow. For more information on the collection of the data, see the MIMIC-III Clinical Database page.

Methods The demo dataset contains all intensive care unit (ICU) stays for 100 patients. These patients were selected randomly from the subset of patients in the dataset who eventually die. Consequently, all patients will have a date of death (DOD). However, patients do not necessarily die during an individual hospital admission or ICU stay.

This project was approved by the Institutional Review Boards of Beth Israel Deaconess Medical Center (Boston, MA) and the Massachusetts Institute of Technology (Cambridge, MA). Requirement for individual patient consent was waived because the project did not impact clinical care and all protected health information was deidentified.

Data Description MIMIC-III is a relational database consisting of 26 tables. For a detailed description of the database structure, see the MIMIC-III Clinical Database page. The demo shares an identical schema, except all rows in the NOTEEVENTS table have been removed.

The data files are distributed in comma separated value (CSV) format following the RFC 4180 standard. Notably, string fields which contain commas, newlines, and/or double quotes are encapsulated by double quotes ("). Actual double quotes in the data are escaped using an additional double quote. For example, the string she said "the patient was notified at 6pm" would be stored in the CSV as "she said ""the patient was notified at 6pm""". More detail is provided on the RFC 4180 description page: https://tools.ietf.org/html/rfc4180

Usage Notes The MIMIC-III demo provides researchers with an opportunity to review the structure and content of MIMIC-III before deciding whether or not to carry out an analysis on the full dataset.

CSV files can be opened natively using any text editor or spreadsheet program. However, some tables are large, and it may be preferable to navigate the data stored in a relational database. One alternative is to create an SQLite database using the CSV files. SQLite is a lightweight database format which stores all constituent tables in a single file, and SQLite databases interoperate well with a number software tools.

DB Browser for SQLite is a high quality, visual, open source tool to create, design, and edit database files compatible with SQLite. We have found this tool to be useful for navigating SQLite files. Information regarding installation of the software and creation of the database can be found online: https://sqlitebrowser.org/

Release Notes Release notes for the demo follow the release notes for the MIMIC-III database.

Acknowledgements This research and development was supported by grants NIH-R01-EB017205, NIH-R01-EB001659, and NIH-R01-GM104987 from the National Institutes of Health. The authors would also like to thank Philips Healthcare and staff at the Beth Israel Deaconess Medical Center, Boston, for supporting database development, and Ken Pierce for providing ongoing support for the MIMIC research community.

Conflicts of Interest The authors declare no competing financial interests.

References Johnson, A. E. W., Pollard, T. J., Shen, L., Lehman, L. H., Feng, M., Ghassemi, M., Mo...

Facebook

Twitter

Click to copy link

Link copied

Cite

U.S. EPA Office of Research and Development (ORD) (2020). Functional Use Database (FUse) [Dataset]. https://catalog.data.gov/dataset/functional-use-database-fuse

Functional Use Database (FUse)

Explore at:

22 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Nov 12, 2020

Dataset provided by

United States Environmental Protection Agencyhttp://www.epa.gov/

Description

There are five different files for this dataset: 1. A dataset listing the reported functional uses of chemicals (FUse) 2. All 729 ToxPrint descriptors obtained from ChemoTyper for chemicals in FUse 3. All EPI Suite properties obtained for chemicals in FUse 4. The confusion matrix values, similarity thresholds, and bioactivity index for each model. 5. The functional use prediction, bioactivity index, and prediction classification (poor prediction, functional substitute, candidate alternative) for each Tox21 chemical. This dataset is associated with the following publication: Phillips, K., J. Wambaugh, C. Grulke, K. Dionisio, and K. Isaacs. High-throughput screening of chemicals as functional substitutes using structure-based classification models. GREEN CHEMISTRY. Royal Society of Chemistry, Cambridge, UK, 19: 1063-1074, (2017).

Clear search

Close search

Google apps

Main menu

Functional Use Database (FUse)

Data from: Agri-Food System Water Use Database

O*NET Database

Dofus Database

Dofus Database

How it has been gathered

Quick presentation

How to use it

DataSheet2_Data Sources for Drug Utilization Research in Brazil—DUR-BRA...

HCUP California

Abstract

Usage

Before Manuscript Submission

Documentation

Time Series International Database: International Populations by Single Year...

Database & Directory Publishing in the US - Market Research Report...

Habitat Use Database - Groundfish Essential Fish Habitat (EFH) Habitat Use...

3D-Genomics Database

TetrapodTraits Database

Health and Retirement Study (HRS)

Harmonized Database of Western U.S. Water Rights (HarDWR)

Arthropod Kraken2 Database v1

Comprehensive Medical Q&A Dataset

Comprehensive Medical Q&A Dataset

Unlocking Healthcare Data with Natural Language Processing

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter...

Transport Indicator Database Survey 2012 - Ghana

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Sampling deviation

Mode of data collection

Research instrument

Cleaning operations

Response rate

Sampling error estimates

Data appraisal

The Nationwide Readmissions Database

Get access of 69 Million Professional's Email Database

mimic-iii-clinical-database-demo-1.4

Functional Use Database (FUse)