Facebook
TwitterThere are five different files for this dataset: 1. A dataset listing the reported functional uses of chemicals (FUse) 2. All 729 ToxPrint descriptors obtained from ChemoTyper for chemicals in FUse 3. All EPI Suite properties obtained for chemicals in FUse 4. The confusion matrix values, similarity thresholds, and bioactivity index for each model. 5. The functional use prediction, bioactivity index, and prediction classification (poor prediction, functional substitute, candidate alternative) for each Tox21 chemical. This dataset is associated with the following publication: Phillips, K., J. Wambaugh, C. Grulke, K. Dionisio, and K. Isaacs. High-throughput screening of chemicals as functional substitutes using structure-based classification models. GREEN CHEMISTRY. Royal Society of Chemistry, Cambridge, UK, 19: 1063-1074, (2017).
Facebook
Twitterhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.7910/DVN/FZK8WEhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.7910/DVN/FZK8WE
This database provides information about the amount of water use in agriculture food systems covering all sectors from farming to food processing industries. The data are presented at the country level with sectoral disaggregation following the Nexus Social Accounting Matrix (SAM) sectoral specifications. The database also differentiates the type of water in each sector based on water sources. The green water refers to type of water originated from precipitation or rain, while the blue water refers to all water that comes from irrigation covering both surface and groundwater. Both types of water are consumed by plants or animals during the production process. The grey water on the other hand is the amount of water generated as an implication from production activities that cause the water polluted. Since it has loads of pollutants created from production activities, this type of water can be seen as a waste in the whole production system.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The O*NET Database contains hundreds of standardized and occupation-specific descriptors on almost 1,000 occupations covering the entire U.S. economy. The database, which is available to the public at no cost, is continually updated by a multi-method data collection program. Sources of data include: job incumbents, occupational experts, occupational analysts, employer job postings, and customer/professional association input.
Data content areas include:
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset has been produced by scrapping the encyclopedia of Dofus' website.
It was first done as a challenge, and a way to gather some textual data about the game. You can find the code used to scrap and parse the data here : https://github.com/Futurne/dofus_scrap.
The main files are the json files, where you will find all scrapped items of the game from the encyclopedia.
Those files are named after their categories in the encyclopedia (e.g. you will find all weapons in the armes.json file).
You can explore those files, they are pretty self explanatory.
Another dataset here is the almanax.csv, which is just a dataset of every descriptions of the almanax, scrapped for a whole year.
For each day, you will find the boss, rubrikabrax and meryde description.
You can use this dataset to finetune a pretrained french model by getting all textual informations (in the almanax.csv and in the json files).
The items have a description property that can be gathered as a big NLP dataset.
You can also do some data analysis : will you find what items are overpowered ? Are the harder to craft ?
Finally, one other idea would be to build an automatic stuff optimizer. You could ask for the best dofus stuff you could have, that maximizes one element while satisfying some constraints.
Note that I ended up finding that there is already a non-official (just like my dataset) API allowing one to get all data. You can checkout their project here : https://dofapi.fr/. They seem not to have updated their project for a long time now though.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Background: In Brazil, studies that map electronic healthcare databases in order to assess their suitability for use in pharmacoepidemiologic research are lacking. We aimed to identify, catalogue, and characterize Brazilian data sources for Drug Utilization Research (DUR).Methods: The present study is part of the project entitled, “Publicly Available Data Sources for Drug Utilization Research in Latin American (LatAm) Countries.” A network of Brazilian health experts was assembled to map secondary administrative data from healthcare organizations that might provide information related to medication use. A multi-phase approach including internet search of institutional government websites, traditional bibliographic databases, and experts’ input was used for mapping the data sources. The reviewers searched, screened and selected the data sources independently; disagreements were resolved by consensus. Data sources were grouped into the following categories: 1) automated databases; 2) Electronic Medical Records (EMR); 3) national surveys or datasets; 4) adverse event reporting systems; and 5) others. Each data source was characterized by accessibility, geographic granularity, setting, type of data (aggregate or individual-level), and years of coverage. We also searched for publications related to each data source.Results: A total of 62 data sources were identified and screened; 38 met the eligibility criteria for inclusion and were fully characterized. We grouped 23 (60%) as automated databases, four (11%) as adverse event reporting systems, four (11%) as EMRs, three (8%) as national surveys or datasets, and four (11%) as other types. Eighteen (47%) were classified as publicly and conveniently accessible online; providing information at national level. Most of them offered more than 5 years of comprehensive data coverage, and presented data at both the individual and aggregated levels. No information about population coverage was found. Drug coding is not uniform; each data source has its own coding system, depending on the purpose of the data. At least one scientific publication was found for each publicly available data source.Conclusions: There are several types of data sources for DUR in Brazil, but a uniform system for drug classification and data quality evaluation does not exist. The extent of population covered by year is unknown. Our comprehensive and structured inventory reveals a need for full characterization of these data sources.
Facebook
TwitterThe State Ambulatory Surgery Databases (SASD), State Inpatient Databases (SID), and State Emergency Department Databases (SEDD) are part of a family of databases and software tools developed for the Healthcare Cost and Utilization Project (HCUP).
HCUP's state-specific databases can be used to investigate state-specific and multi-state trends in health care utilization, access, charges, quality, and outcomes. PHS has several years (2008-2011) and datasets (SASSD, SED and SIDD) for HCUP California available.
The State Ambulatory Surgery and Services Databases (SASD) are State-specific files that include data for ambulatory surgery and other outpatient services from hospital-owned facilities. In addition, some States provide ambulatory surgery and outpatient services from nonhospital-owned facilities. The uniform format of the SASD helps facilitate cross-State comparisons. The SASD are well suited for research that requires complete enumeration of hospital-based ambulatory surgeries within geographic areas or States.
The State Inpatient Databases (SID) are State-specific files that contain all inpatient care records in participating states. Together, the SID encompass more than 95 percent of all U.S. hospital discharges. The uniform format of the SID helps facilitate cross-state comparisons. In addition, the SID are well suited for research that requires complete enumeration of hospitals and discharges within geographic areas or states.
The State Emergency Department Databases (SEDD) are a set of longitudinal State-specific emergency department (ED) databases included in the HCUP family. The SEDD capture discharge information on all emergency department visits that do not result in an admission. Information on patients seen in the emergency room and then admitted to the hospital is included in the State Inpatient Databases (SID)
SASD, SID, and SEDD each have **Documentation **which includes:
%3C!-- --%3E
All manuscripts (and other items you'd like to publish) must be submitted to
phsdatacore@stanford.edu for approval prior to journal submission.
We will check your cell sizes and citations.
For more information about how to cite PHS and PHS datasets, please visit:
https:/phsdocs.developerhub.io/need-help/citing-phs-data-core
The HCUP California inpatient files were constructed from the confidential files received from the Office of Statewide Health Planning and Development (OSHPD). OSHPD excluded inpatient stays that, after processing by OSHPD, did not contain a complete and “in-range” admission date or discharge date. California also excluded inpatient stays that had an unknown or missing date of birth. OSHPD removes ICD-9-CM and ICD-10-CM diagnoses codes for HIV test results. Beginning with 2009 data, OSHPD changed regulations to require hospitals to report all external cause of injury diagnosis codes including those specific to medical misadventures. Prior to 2009, OSHPD did not require collection of diagnosis codes identifying medical misadventures.
**Types of Facilities Included in the Files Provided to HCUP by the Partner **
California supplied discharge data for inpatient stays in general acute care hospitals, acute psychiatric hospitals, chemical dependency recovery hospitals, psychiatric health facilities, and state operated hospitals. A comparison of the number of hospitals included in the SID and the number of hospitals reported in the AHA Annual Survey is available starting in data year 2010. Hospitals do not always report data for a full calendar year. Some hospitals open or close during the year; other hospitals have technical problems that prevent them from reporting data for all months in a year.
**Inclusion of Stays in Special Units **
Included with the general acute care stays are stays in skilled nursing, intermediate care, rehabilitation, alcohol/chemical dependency treatment, and psychiatric units of hospitals in California. How the stays in these different types of units can be identified differs by data year. Beginning in 2006, the information is retained in the HCUP variable HOSPITALUNIT. Reliability of this indicator for the level of care depends on how it was assigned by the hospital. For data years 1998-2006, the information was retained in the HCUP variable LEVELCARE. Prior to 1998, the first
Facebook
TwitterMidyear population estimates and projections for all countries and areas of the world with a population of 5,000 or more // Source: U.S. Census Bureau, Population Division, International Programs Center// Note: Total population available from 1950 to 2100 for 227 countries and areas. Other demographic variables available from base year to 2100. Base year varies by country and therefore data are not available for all years for all countries. See methodologyhttps://www.census.gov/programs-surveys/international-programs/about/idb.html
Facebook
Twitterhttps://www.ibisworld.com/about/termsofuse/https://www.ibisworld.com/about/termsofuse/
With the phone book era far in the past, database and directory publishers have been forced to transform their business approach, focusing on their digital presence. Despite many publishers rapidly moving away from print services, they are experiencing immovable competition from online search engines and social media platforms within the digital space, negatively affecting revenue growth potential. Industry revenue has been eroding at a CAGR of 4.4% over the past five years and in 2024, a 3.9% drop has led to the industry revenue totaling $4.4 billion. Profit continues to drop in line with revenue, accounting for 4.7% of revenue as publishers invest more in their digital platforms. Interest in printed directories has disappeared as institutional clients and consumers have continued their shift to convenient online resources. Declining demand for print advertising has curbed revenue growth and online revenue has only slightly mitigated this downturn. Though many traditional publishers, such as Yellow Pages, now operate under parent companies with digital resources, directory publishers remain low on the list of options businesses have to choose from in digital advertising. Due to the convenience and connectivity that Facebook and Google services offer, traditional directory publishers have a limited ability to compete. Many providers have rebranded and tailored their services toward client needs, though these efforts have only had a marginal impact on revenue growth. The industry is forecast to decline at an accelerated CAGR of 5.2% over the next five years, reaching an estimated $3.4 billion in 2029, as businesses and consumers continually turn to digital alternatives for information and advertising opportunities. As AI and digital technology innovation expands, social media company products will likely improve at a faster rate than the digital offerings that directory publishers can provide. Though these companies will seek external partnerships to cut costs, they face an uphill battle to boost their visibility and reverse consumer habit trends.
Facebook
TwitterThe Habitat Use Database (HUD) was specifically designed to address the need for habitat-use analyses in support of groundfish EFH, HAPCs, and fishing and nonfishing impacts components of the 2005 EFH EIS. HUD functionality and accessibility, and the ecological information upon which the HUD is based, will be improved in order for this database to fully support fisheries and ecosystem science and management. Upgrades to and applications of the HUD will be facilitated through a series of prioritized phases: • Fully integrate the data entry, quality control, and reporting capabilities from the original HUD Access database with a web-based and programmatic interface. Improve software for HUD to accommodate the most current habitat maps and habitat classification codes. This will be achieved by NMFS in consultation with HUD architects at Oregon State University. • Review and update the biological and ecological information in the HUD. • Develop and apply improved models that will be used to create updated habitat suitability maps for all west coast groundfish species using the updated HUD and Pacific coast seafloor habitat maps. • Integrate habitat suitability models with the online groundfish EFH data catalog (http://efh-catalog.coas.oregonstate.edu/overview/). 2005 habitat-use analysis supporting groundfish EFH.
Facebook
TwitterTHIS RESOURCE IS NO LONGER IN SERVICE, documented August 29, 2016. Database containing structural annotations for the proteomes of just under 100 organisms. Using data derived from public databases of translated genomic sequences, representatives from the major branches of Life are included: Prokaryota, Eukaryota and Archaea. The annotations stored in the database may be accessed in a number of ways. The help page provides information on how to access the database. 3D-GENOMICS is now part of a larger project, called e-Protein. The project brings together similar databases at three sites: Imperial College London , University College London and the European Bioinformatics Institute . e-Protein''s mission statement is To provide a fully automated distributed pipeline for large-scale structural and functional annotation of all major proteomes via the use of cutting-edge computer GRID technologies. The following databases are incorporated: NRprot, SCOP, ASTRAL, PFAM, Prosite, taxonomy, COG The following eukaryotic genomes are incorporated: Anopheles gambiae, protein sequences from the mosquito genome; Arabidopsis thaliana, protein sequences from the Arabidopsis genome; Caenorhabditis briggsae, protein sequences from the C.briggsae genome; Caenorhabditis elegans protein sequences from the worm genome; Ciona intestinalis protein sequences from the sea squirt genome; Danio rerio protein sequences from the zebrafish genome; Drosophila melanogaster protein sequences from the fruitfly genome; Encephalitozoon cuniculi protein sequences from the E.cuniculi genome; Fugu rubripes protein sequences from the pufferfish genome; Guillardia theta protein sequences from the G.theta genome; Homo sapiens protein sequences from the human genome; Mus musculus protein sequences from the mouse genome; Neurospora crassa protein sequences from the N.crassa genome; Oryza sativa protein sequences from the rice genome; Plasmodium falciparum protein sequences from the P.falciparum genome; Rattus norvegicus protein sequences from the rat genome; Saccharomyces cerevisiae protein sequences from the yeast genome; Schizosaccharomyces pombe protein sequences from the yeast genome
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract
Tetrapods (amphibians, reptiles, birds and mammals) are model systems for global biodiversity science, but continuing data gaps, limited data standardisation, and ongoing flux in taxonomic nomenclature constrain integrative research on this group and potentially cause biassed inference. We combined and harmonised taxonomic, spatial, phylogenetic, and attribute data with phylogeny-based multiple imputation to provide a comprehensive data resource (TetrapodTraits 1.0.0) that includes values, predictions, and sources for body size, activity time, micro- and macrohabitat, ecosystem, threat status, biogeography, insularity, environmental preferences and human influence, for all 33,281 tetrapod species covered in recent fully sampled phylogenies. We assess gaps and biases across taxa and space, finding that shared data missing in attribute values increased with taxon-level completeness and richness across clades. Prediction of missing attribute values using multiple imputation revealed substantial changes in estimated macroecological patterns. These results highlight biases incurred by non-random missingness and strategies to best address them. While there is an obvious need for further data collection and updates, our phylogeny-informed database of tetrapod traits can support a more comprehensive representation of tetrapod species and their attributes in ecology, evolution, and conservation research.
Additional Information: This work is output of the VertLife project. To flag erros, provide updates, or leave other comments, please go to vertlife.org. We aim to develop the database into a living resource at vertlife.org and your feedback is essential to improve data quality and support community use.
Version 1.0.1 (25 May 2024). This minor release addresses a spelling error in the file Tetrapod_360.csv. The error involves replacing white-space characters with underscore characters in the field Scientific.Name to match the spelling used in the file TetrapodTraits_1.0.0.csv. These corrections affect only 102 species considered extinct and 13 domestic species (Bos_frontalis, Bos_grunniens, Bos_indicus, Bos_taurus, Camelus_bactrianus, Camelus_dromedarius, Capra_hircus, Cavia_porcellus, Equus_caballus, Felis_catus, Lama_glama, Ovis_aries, Vicugna_pacos). All extinct and domestic species in TetrapodTraits have their binomial names separated by underscore symbols instead of white space. Additionally, we have added the file GridCellShapefile.zip, which contains the shapefile required to map species presence across the 110 × 110 km equal area grid cells (this file was previously provided through an External Source here).
Version 1.0.0 (19 April 2024). TetrapodTraits, the full phylogenetically coherent database we developed, is being made publicly available to support a range of research applications in ecology, evolution, and conservation and to help minimise the impacts of biassed data in this model system. The database includes 24 species-level attributes linked to their respective sources across 33,281 tetrapod species. Specific fields clearly label data sources and imputations in the TetrapodTraits, while additional tables record the 10K values per missing entry per species.
Taxonomy – includes 8 attributes that inform scientific names and respective higher-level taxonomic ranks, authority name, and year of species description. Field names: Scientific.Name, Genus, Family, Suborder, Order, Class, Authority, and YearOfDescription.
Phylogenetic tree – includes 2 attributes that notify which fully-sampled phylogeny contains the species, along with whether the species placement was imputed or not in the phylogeny. Field names: TreeTaxon, TreeImputed.
Body size – includes 7 attributes that inform length, mass, and data sources on species sizes, and details on the imputation of species length or mass. Field names: BodyLength_mm, LengthMeasure, ImputedLength, SourceBodyLength, BodyMass_g, ImputedMass, SourceBodyMass.
Activity time – includes 5 attributes that describe period of activity (e.g., diurnal, fossorial) as dummy (binary) variables, data sources, details on the imputation of species activity time, and a nocturnality score. Field names: Diu, Noc, ImputedActTime, SourceActTime, Nocturnality.
Microhabitat – includes 8 attributes covering habitat use (e.g., fossorial, terrestrial, aquatic, arboreal, aerial) as dummy (binary) variables, data sources, details on the imputation of microhabitat, and a verticality score. Field names: Fos, Ter, Aqu, Arb, Aer, ImputedHabitat, SourceHabitat, Verticality.
Macrohabitat – includes 19 attributes that reflect major habitat types according to the IUCN classification, the sum of major habitats, data source, and details on the imputation of macrohabitat. Field names: MajorHabitat_1 to MajorHabitat_10, MajorHabitat_12 to MajorHabitat_17, MajorHabitatSum, ImputedMajorHabitat, SourceMajorHabitat. MajorHabitat_11, representing the marine deep ocean floor (unoccupied by any species in our database), is not included here.
Ecosystem – includes 6 attributes covering species ecosystem (e.g., terrestrial, freshwater, marine) as dummy (binary) variables, the sum of ecosystem types, data sources, and details on the imputation of ecosystem. Field names: EcoTer, EcoFresh, EcoMar, EcosystemSum, ImputedEcosystem, SourceEcosystem.
Threat status – includes 3 attributes that inform the assessed threat statuses according to IUCN red list and related literature. Field names: IUCN_Binomial, AssessedStatus, SourceStatus.
RangeSize – the number of 110×110 grid cells covered by the species range map. Data derived from MOL.
Latitude – coordinate centroid of the species range map.
Longitude – coordinate centroid of the species range map.
Biogeography – includes 8 attributes that present the proportion of species range within each WWF biogeographical realm. Field names: Afrotropic, Australasia, IndoMalay, Nearctic, Neotropic, Oceania, Palearctic, Antarctic.
Insularity – includes 2 attributes that notify if a species is insular endemic (binary, 1 = yes, 0 = no), followed by the respective data source. Field names: Insularity, SourceInsularity.
AnnuMeanTemp – Average within-range annual mean temperature (Celsius degree). Data derived from CHELSA v. 1.2.
AnnuPrecip – Average within-range annual precipitation (mm). Data derived from CHELSA v. 1.2.
TempSeasonality – Average within-range temperature seasonality (Standard deviation × 100). Data derived from CHELSA v. 1.2.
PrecipSeasonality – Average within-range precipitation seasonality (Coefficient of Variation). Data derived from CHELSA v. 1.2.
Elevation – Average within-range elevation (metres). Data derived from topographic layers in EarthEnv.
ETA50K – Average within-range estimated time to travel to cities with a population >50K in the year 2015. Data from Nelson et al. (2019).
HumanDensity – Average within-range human population density in 2017. Data derived from HYDE v. 3.2.
PropUrbanArea – Proportion of species range map covered by built-up area, such as towns, cities, etc. at year 2017. Data derived from HYDE v. 3.2.
PropCroplandArea – Proportion of species range map covered by cropland area, identical to FAO's category 'Arable land and permanent crops' at year 2017. Data derived from HYDE v. 3.2.
PropPastureArea – Proportion of species range map covered by cropland, defined as Grazing land with an aridity index > 0.5, assumed to be more intensively managed (converted in climate models) at year 2017. Data derived from HYDE v. 3.2.
PropRangelandArea – Proportion of species range map covered by rangeland, defined as Grazing land with an aridity index < 0.5, assumed to be less or not managed (not converted in climate models) at year 2017. Data derived from HYDE v. 3.2.
File content
All files use UTF-8 encoding.
ImputedSets.zip – the phylogenetic multiple imputation framework applied to the TetrapodTraits database produced 10,000 imputed values per missing data entry (= 100 phylogenetic trees x 10 validation-folds x 10 multiple imputations). These imputations were specifically developed for four fundamental natural history traits: Body length, Body mass, Activity time, and Microhabitat. To facilitate the evaluation of each imputed value in a user-friendly format, we offer 10,000 tables containing both observed and imputed data for the 33,281 species in the TetrapodTraits database. Each table encompasses information about the four targeted natural history traits, along with designated fields (e.g., ImputedMass) that clearly indicate whether the trait value provided (e.g., BodyMass_g) corresponds to observed (e.g., ImputedMass = 0) or imputed (e.g., ImputedMass = 1) data. Given that the complete set of 10,000 tables necessitates nearly 17GB of storage space, we have organized sets of 1,000 tables into separate zip files to streamline the download process.
ImputedSets_1K.zip, imputations for trees 1 to 10.
ImputedSets_2K.zip, imputations for trees 11 to 20.
ImputedSets_3K.zip, imputations for trees 21 to 30.
ImputedSets_4K.zip, imputations for trees 31 to 40.
ImputedSets_5K.zip, imputations for trees 41 to 50.
ImputedSets_6K.zip, imputations for trees 51 to 60.
ImputedSets_7K.zip, imputations for trees 61 to 70.
ImputedSets_8K.zip, imputations for trees 71 to 80.
ImputedSets_9K.zip, imputations for trees 81 to 90.
ImputedSets_10K.zip, imputations for trees 91 to 100.
TetrapodTraits_1.0.0.csv – the complete TetrapodTraits database, with missing data entries in natural history traits (body length, body mass, activity time, and microhabitat) replaced by the average across the 10K imputed values obtained through phylogenetic multiple imputation. Please note that imputed microhabitat (attribute fields: Fos, Ter, Aqu, Arb, Aer) and imputed activity time (attribute fields: Diu, Noc) are continuous variables within the 0-1 range interval. At the user's
Facebook
Twitteranalyze the health and retirement study (hrs) with r the hrs is the one and only longitudinal survey of american seniors. with a panel starting its third decade, the current pool of respondents includes older folks who have been interviewed every two years as far back as 1992. unlike cross-sectional or shorter panel surveys, respondents keep responding until, well, death d o us part. paid for by the national institute on aging and administered by the university of michigan's institute for social research, if you apply for an interviewer job with them, i hope you like werther's original. figuring out how to analyze this data set might trigger your fight-or-flight synapses if you just start clicking arou nd on michigan's website. instead, read pages numbered 10-17 (pdf pages 12-19) of this introduction pdf and don't touch the data until you understand figure a-3 on that last page. if you start enjoying yourself, here's the whole book. after that, it's time to register for access to the (free) data. keep your username and password handy, you'll need it for the top of the download automation r script. next, look at this data flowchart to get an idea of why the data download page is such a righteous jungle. but wait, good news: umich recently farmed out its data management to the rand corporation, who promptly constructed a giant consolidated file with one record per respondent across the whole panel. oh so beautiful. the rand hrs files make much of the older data and syntax examples obsolete, so when you come across stuff like instructions on how to merge years, you can happily ignore them - rand has done it for you. the health and retirement study only includes noninstitutionalized adults when new respondents get added to the panel (as they were in 1992, 1993, 1998, 2004, and 2010) but once they're in, they're in - respondents have a weight of zero for interview waves when they were nursing home residents; but they're still responding and will continue to contribute to your statistics so long as you're generalizing about a population from a previous wave (for example: it's possible to compute "among all americans who were 50+ years old in 1998, x% lived in nursing homes by 2010"). my source for that 411? page 13 of the design doc. wicked. this new github repository contains five scripts: 1992 - 2010 download HRS microdata.R loop through every year and every file, download, then unzip everything in one big party impor t longitudinal RAND contributed files.R create a SQLite database (.db) on the local disk load the rand, rand-cams, and both rand-family files into the database (.db) in chunks (to prevent overloading ram) longitudinal RAND - analysis examples.R connect to the sql database created by the 'import longitudinal RAND contributed files' program create tw o database-backed complex sample survey object, using a taylor-series linearization design perform a mountain of analysis examples with wave weights from two different points in the panel import example HRS file.R load a fixed-width file using only the sas importation script directly into ram with < a href="http://blog.revolutionanalytics.com/2012/07/importing-public-data-with-sas-instructions-into-r.html">SAScii parse through the IF block at the bottom of the sas importation script, blank out a number of variables save the file as an R data file (.rda) for fast loading later replicate 2002 regression.R connect to the sql database created by the 'import longitudinal RAND contributed files' program create a database-backed complex sample survey object, using a taylor-series linearization design exactly match the final regression shown in this document provided by analysts at RAND as an update of the regression on pdf page B76 of this document . click here to view these five scripts for more detail about the health and retirement study (hrs), visit: michigan's hrs homepage rand's hrs homepage the hrs wikipedia page a running list of publications using hrs notes: exemplary work making it this far. as a reward, here's the detailed codebook for the main rand hrs file. note that rand also creates 'flat files' for every survey wave, but really, most every analysis you c an think of is possible using just the four files imported with the rand importation script above. if you must work with the non-rand files, there's an example of how to import a single hrs (umich-created) file, but if you wish to import more than one, you'll have to write some for loops yourself. confidential to sas, spss, stata, and sudaan users: a tidal wave is coming. you can get water up your nose and be dragged out to sea, or you can grab a surf board. time to transition to r. :D
Facebook
TwitterFrom Lisk et al. (in review): "In the arid and semi-arid western U.S., access to water is regulated through a legal system of water rights. Individuals, companies, organizations, municipalities, and tribal entities have documents that declare their water rights. State water regulatory agencies collate and maintain these records, which can be used in legal disputes over access to water. While these records are publicly available data in all western U.S. states, the data have not yet been readily available in digital form from all states. Furthermore, there are many differences in data format, terminology, and definitions between state water regulatory agencies. Here, we have collected water rights data from 11 western U.S. state agencies, harmonized terminology and use definitions, formatted them consistently, and tied them to a western U.S.-wide shapefile of water administrative boundaries. We demonstrate how these data enable consistent regional-scale western U.S. hydrologic and economic modeling."
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Kraken2 Arthopod Reference Database v.1Kraken2 (v2.1.2) database containing all 2,593 reference assemblies for Arthropoda available on NCBI as of March 2023.
This database was built for and used in the analysis of shotgun sequencing data of bulkDNA from Malaise trap samples collected by the Insect Biome Atlas, in the context of the manuscript "Small Bugs, Big Data: Metagenomics for arthropod biodiversity monitoring" by authors: López Clinton Samantha, Iwaszkiewicz-Eggebrecht Ela, Miraldo Andreia, Goodsell Robert, Webster Mathew T, Ronquist Fredrik, van der Valk Tom (for submission to Ecology and Evolution).
For custom database building, Kraken2 requires all headers in reference assembly fasta files to be annotated with "kraken:taxid|XXX" at the end of each header. Where "XXX" is the corresponding National Center for Biotechnology Information (NCBI) taxID of the species. The code used to add the taxID information to each fasta file header, and update the accession2taxid.map file required by Kraken2 for database building, is available in this GitHub repository (https://github.com/SamanthaLop/Small_Bugs_Big_Data) (also linked under "Related Materials" below).
ContentBelow is a list of the files in this item (in addition to the README and MANIFEST files), and their description. The first three files (marked with a *) are required to run Kraken2 classifications using the database.
We also recommend using the Kraken2 option --memory-mapping, as it ensures the database is loaded once for all samples, instead of once for each individual sample, saving considerable time and resources.
For more information on using Kraken2, see the Kraken2 wiki manual (https://github.com/DerrickWood/kraken2/wiki/Manual) .
This database was built by Samantha López Clinton (samantha.lopezclinton@nrm) and Tom van der Valk (tom.vandervalk@nrm.se).
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
The MedQuad dataset provides a comprehensive source of medical questions and answers for natural language processing. With over 43,000 patient inquiries from real-life situations categorized into 31 distinct types of questions, the dataset offers an invaluable opportunity to research correlations between treatments, chronic diseases, medical protocols and more. Answers provided in this database come not only from doctors but also other healthcare professionals such as nurses and pharmacists, providing a more complete array of responses to help researchers unlock deeper insights within the realm of healthcare. This incredible trove of knowledge is just waiting to be mined - so grab your data mining equipment and get exploring!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
In order to make the most out of this dataset, start by having a look at the column names and understanding what information they offer: qtype (the type of medical question), Question (the question in itself), and Answer (the expert response). The qtype column will help you categorize the dataset according to your desired question topics. Once you have filtered down your criteria as much as possible using qtype, it is time to analyze the data. Start by asking yourself questions such as “What treatments do most patients search for?” or “Are there any correlations between chronic conditions and protocols?” Then use simple queries such as SELECT Answer FROM MedQuad WHERE qtype='Treatment' AND Question LIKE '%pain%' to get closer to answering those questions.
Once you have obtained new insights about healthcare based on the answers provided in this dynmaic data set - now it’s time for action! Use all that newfound understanding about patient needs in order develop educational materials and implement any suggested changes necessary. If more criteria are needed for querying this data set see if MedQuad offers additional columns; sometimes extra columns may be added periodically that could further enhance analysis capabilities; look out for notifications if these happen.
Finally once making an impact with the use case(s) - don't forget proper citation etiquette; give credit where credit is due!
- Developing medical diagnostic tools that use natural language processing (NLP) to better identify and diagnose health conditions in patients.
- Creating predictive models to anticipate treatment options for different medical conditions using machine learning techniques.
- Leveraging the dataset to build chatbots and virtual assistants that are able to answer a broad range of questions about healthcare with expert-level accuracy
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: train.csv | Column name | Description | |:--------------|:------------------------------------------------------| | qtype | The type of medical question. (String) | | Question | The medical question posed by the patient. (String) | | Answer | The expert response to the medical question. (String) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub.
This repository contains two files:
The dump.tar.bz2 file contains a PostgreSQL dump of the database, with all the data we extracted from the notebooks.
The jupyter_reproducibility.tar.bz2 file contains all the scripts we used to query and download Jupyter Notebooks, extract data from them, and analyze the data. It is organized as follows:
In the remaining of this text, we give instructions for reproducing the analyses, by using the data provided in the dump and reproducing the collection, by collecting data from GitHub again.
Reproducing the Analysis
This section shows how to load the data in the database and run the analyses notebooks. In the analysis, we used the following environment:
Ubuntu 18.04.1 LTS
PostgreSQL 10.6
Conda 4.5.11
Python 3.7.2
PdfCrop 2012/11/02 v1.38
First, download dump.tar.bz2 and extract it:
tar -xjf dump.tar.bz2
It extracts the file db2019-03-13.dump. Create a database in PostgreSQL (we call it "jupyter"), and use psql to restore the dump:
psql jupyter < db2019-03-13.dump
It populates the database with the dump. Now, configure the connection string for sqlalchemy by setting the environment variable JUP_DB_CONNECTTION:
export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter";
Download and extract jupyter_reproducibility.tar.bz2:
tar -xjf jupyter_reproducibility.tar.bz2
Create a conda environment with Python 3.7:
conda create -n analyses python=3.7
conda activate analyses
Go to the analyses folder and install all the dependencies of the requirements.txt
cd jupyter_reproducibility/analyses
pip install -r requirements.txt
For reproducing the analyses, run jupyter on this folder:
jupyter notebook
Execute the notebooks on this order:
Reproducing or Expanding the Collection
The collection demands more steps to reproduce and takes much longer to run (months). It also involves running arbitrary code on your machine. Proceed with caution.
Requirements
This time, we have extra requirements:
All the analysis requirements
lbzip2 2.5
gcc 7.3.0
Github account
Gmail account
Environment
First, set the following environment variables:
export JUP_MACHINE="db"; # machine identifier
export JUP_BASE_DIR="/mnt/jupyter/github"; # place to store the repositories
export JUP_LOGS_DIR="/home/jupyter/logs"; # log files
export JUP_COMPRESSION="lbzip2"; # compression program
export JUP_VERBOSE="5"; # verbose level
export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; # sqlchemy connection
export JUP_GITHUB_USERNAME="github_username"; # your github username
export JUP_GITHUB_PASSWORD="github_password"; # your github password
export JUP_MAX_SIZE="8000.0"; # maximum size of the repositories directory (in GB)
export JUP_FIRST_DATE="2013-01-01"; # initial date to query github
export JUP_EMAIL_LOGIN="gmail@gmail.com"; # your gmail address
export JUP_EMAIL_TO="target@email.com"; # email that receives notifications
export JUP_OAUTH_FILE="~/oauth2_creds.json" # oauth2 auhentication file
export JUP_NOTEBOOK_INTERVAL=""; # notebook id interval for this machine. Leave it in blank
export JUP_REPOSITORY_INTERVAL=""; # repository id interval for this machine. Leave it in blank
export JUP_WITH_EXECUTION="1"; # run execute python notebooks
export JUP_WITH_DEPENDENCY="0"; # run notebooks with and without declared dependnecies
export JUP_EXECUTION_MODE="-1"; # run following the execution order
export JUP_EXECUTION_DIR="/home/jupyter/execution"; # temporary directory for running notebooks
export JUP_ANACONDA_PATH="~/anaconda3"; # conda installation path
export JUP_MOUNT_BASE="/home/jupyter/mount_ghstudy.sh"; # bash script to mount base dir
export JUP_UMOUNT_BASE="/home/jupyter/umount_ghstudy.sh"; # bash script to umount base dir
export JUP_NOTEBOOK_TIMEOUT="300"; # timeout the extraction
# Frequenci of log report
export JUP_ASTROID_FREQUENCY="5";
export JUP_IPYTHON_FREQUENCY="5";
export JUP_NOTEBOOKS_FREQUENCY="5";
export JUP_REQUIREMENT_FREQUENCY="5";
export JUP_CRAWLER_FREQUENCY="1";
export JUP_CLONE_FREQUENCY="1";
export JUP_COMPRESS_FREQUENCY="5";
export JUP_DB_IP="localhost"; # postgres database IP
Then, configure the file ~/oauth2_creds.json, according to yagmail documentation: https://media.readthedocs.org/pdf/yagmail/latest/yagmail.pdf
Configure the mount_ghstudy.sh and umount_ghstudy.sh scripts. The first one should mount the folder that stores the directories. The second one should umount it. You can leave the scripts in blank, but it is not advisable, as the reproducibility study runs arbitrary code on your machine and you may lose your data.
Scripts
Download and extract jupyter_reproducibility.tar.bz2:
tar -xjf jupyter_reproducibility.tar.bz2
Install 5 conda environments and 5 anaconda environments, for each python version. In each of them, upgrade pip, install pipenv, and install the archaeology package (Note that it is a local package that has not been published to pypi. Make sure to use the -e option):
Conda 2.7
conda create -n raw27 python=2.7 -y
conda activate raw27
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Anaconda 2.7
conda create -n py27 python=2.7 anaconda -y
conda activate py27
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Conda 3.4
It requires a manual jupyter and pathlib2 installation due to some incompatibilities found on the default installation.
conda create -n raw34 python=3.4 -y
conda activate raw34
conda install jupyter -c conda-forge -y
conda uninstall jupyter -y
pip install --upgrade pip
pip install jupyter
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
pip install pathlib2
Anaconda 3.4
conda create -n py34 python=3.4 anaconda -y
conda activate py34
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Conda 3.5
conda create -n raw35 python=3.5 -y
conda activate raw35
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Anaconda 3.5
It requires the manual installation of other anaconda packages.
conda create -n py35 python=3.5 anaconda -y
conda install -y appdirs atomicwrites keyring secretstorage libuuid navigator-updater prometheus_client pyasn1 pyasn1-modules spyder-kernels tqdm jeepney automat constantly anaconda-navigator
conda activate py35
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Conda 3.6
conda create -n raw36 python=3.6 -y
conda activate raw36
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Anaconda 3.6
conda create -n py36 python=3.6 anaconda -y
conda activate py36
conda install -y anaconda-navigator jupyterlab_server navigator-updater
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Conda 3.7
<code
Facebook
TwitterThe efficient development, maintenance and administration of transport infrastructure and services are critical to the socio-economic development of any country. Scarce government resources and support from donor funds are required to provide these essential services to all sectors for the economic development of the country and for attaining equity and the participation of the populace in the creation of wealth and reduction of poverty.
To ascertain the effectiveness of implementation of policies and development programs, for transport related infrastructure and services key performance indicators are required. The data for developing these performance indicators must be collected on a sustainable basis by the various sectors for collation and analysis. Although most of the relevant basic data exist in many establishments, these are often scattered and are not collated nor disseminated in any structured manner. The Transportation sector is no exception. A recent study of the Ghana Road Sub-sector Programme finds that there is an urgent need to reinforce the monitoring system of MRT as performance indicators have only partially been collected and used; the road condition mix is monitored on an annual basis while other basic performance indicators are lacking. A good monitoring system will help improve the policy formulation within the sub-sector while its absence may result in a major fund funding reduction because the contribution to national development objectives, such as poverty alleviation, cannot be substantiated and demonstrated.
Objectives of survey The development objective of the TSPS-II as defined in the Ghana Poverty Reduction Strategy (GPRS), to sustain economic growth through the provision of safe, reliable, efficient and affordable services for all transport users. The focus of the transport sector under the GPRS is to provide access through better distribution of the transport network with special emphasis on high poverty areas in order to reduce transport disparities between the urban and rural communities. The household survey is a component of a bigger programme which will serve as a reliable and sustainable one-stop shop for all the data and performance indicators for the transport sector. The immediate objective of the sub-component is to improve the effectiveness of implementation of policies and development programmes for the transport sector, including related infrastructure and services. The direct aim of the sub-component will be the collection, processing, analysis, documentation and dissemination of transport related data, which will be useful for:
National level Region Level
Household and Individual
The survey covered all household members (Usual residents)
Sample survey data [ssd]
The sample was representative of all households in Ghana. To achieve the study objectives, the sample size chosen was based on the type of variables under consideration, the required precision of the survey estimates and available resources. Taking all of these into consideration, a sample size of 6,000 households was deemed sufficient to achieve the survey objectives. This was enough to yield reliable estimates of all the important survey variables as well as being manageable to control and minimize non-sampling errors.
Stratification and Sample Selection Procedures The total list of the Enumeration Areas (EAs) from the demarcation for the 2010 Population and Housing Census formed the sampling frame for the Phase II of the Transport Indicators Survey. The sampling frame was stratified into urban/rural residence and the 10 administrative regions of the country for the selection of the sample. The sample was selected in two stages.
The first stage selection involved the systematic selection of 400 EAs with probability proportional to size, the measure of size being the number of households in each EA. The second stage selection involved the systematic selection of 15 households from each EA. See Appendix A for more details on the sample design.
No deviations
Face-to-face [f2f]
The questionnaire had the following sections:
Section A: a household roster which collected basic information on all households members and household characteristics to determine eligible household members
Section B: an education section which was administered to household members aged 3 years and older on the use of transport services to school
Section C: a health section that was used to collect information on all household members on access and the use of transport services to health facilities
Section D: an economic activity section administered to household members 7 years and older to collect information on their economic activities and the use of transport services a market access section administered to household members engaged in agricultural activities to collect information on access to transport services for sale of farm produce
Section E: a general transport services section administered to all household members on the access and use of various modes of transport.
Section F: a general transport services section administered to all households and use of various modes of transport.
Control mechanisms were inbuilt in the data capturing application. Range checks and skip patterns were incorporated into the data capturing application. Partial double entry was done in order to compare and correct errors. After data capture secondary editng was done in the form of consistency checks. CSPro 4.1 was used to capture the data.
National: (5996/6000)*100=99.93%
By Regions: Western=99.8% Central= 100.0% Greater Accra= 100.0% Volta = 99.5% Eastern=100.0% Ashanti = 100.0% Brong Ahafo = 100.0% Northern = 100.0% Upper East = 100.0% Upper West = 100.0%
Region Hhs completed Hhs Expected Response rate
Western 569 570 99.8
Central 510 510 100.0
Greater Accra 855 855 100.0
Volta 567 570 99.5
Eastern 705 705 100.0
Ashanti 1,125 1,125 100.0
Brong Ahafo 585 585 100.0
Northern 615 615 100.0
Upper East 285 285 100.0
Upper West 180 180 100.0
Total 5,996 6,000 99.9
Causes of non response
Region
Result of Interview Western Volta Total
Refused 1 0 1
No HHold Member at Home 0 2 2
Other 0 1 1
Total 1 3 4
Sample errors was calculated but not in the report.
No other forms of data appraisal
Facebook
TwitterThe Nationwide Readmissions Database (NRD) is a unique and powerful database designed to support various types of analyses of national readmission rates for all patients regardless of the expected payer for the hospital stay. The NRD includes discharges for patients with and without repeat hospital visits in a year and those who have died in the hospital. Repeat stays may or may not be related. The criteria to determine the relationship between hospital admissions is left to the analyst using the NRD. This database addresses a large gap in healthcare data - the lack of nationally representative information on hospital readmissions for all ages.
Facebook
TwitterA good DATA is crucial for any business or organization to grow the network. This is because all relevant details about the company and user are stored in the database. Your companies have benefited from using our email database to extract their prospect's details.
It is a well-known fact that LinkedIn gives you the opportunity to expand your business network. You can easily connect with your prospects, directly or through mutual connections, by using search keywords related to their name, company, profile, address, etc. However, we're a leading data provider, with us you do not need to do such a thing. Our Professional's email database contains all the necessary business information from your prospects. There are several ways to access them (especially email addresses and phone numbers).
With our service, you can reach over 69 million records in 200+ countries. Our database is well organized and keeps information easily accessible, so you can use it. Easily increase your sales with reliable LinkedIn data that connects you directly to your goal, here we have worked hard to supply quality, reliable, sustainable email databases.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Abstract MIMIC-III is a large, freely-available database comprising deidentified health-related data associated with over 40,000 patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012 [1]. The MIMIC-III Clinical Database is available on PhysioNet (doi: 10.13026/C2XW26). Though deidentified, MIMIC-III contains detailed information regarding the care of real patients, and as such requires credentialing before access. To allow researchers to ascertain whether the database is suitable for their work, we have manually curated a demo subset, which contains information for 100 patients also present in the MIMIC-III Clinical Database. Notably, the demo dataset does not include free-text notes.
Background In recent years there has been a concerted move towards the adoption of digital health record systems in hospitals. Despite this advance, interoperability of digital systems remains an open issue, leading to challenges in data integration. As a result, the potential that hospital data offers in terms of understanding and improving care is yet to be fully realized.
MIMIC-III integrates deidentified, comprehensive clinical data of patients admitted to the Beth Israel Deaconess Medical Center in Boston, Massachusetts, and makes it widely accessible to researchers internationally under a data use agreement. The open nature of the data allows clinical studies to be reproduced and improved in ways that would not otherwise be possible.
The MIMIC-III database was populated with data that had been acquired during routine hospital care, so there was no associated burden on caregivers and no interference with their workflow. For more information on the collection of the data, see the MIMIC-III Clinical Database page.
Methods The demo dataset contains all intensive care unit (ICU) stays for 100 patients. These patients were selected randomly from the subset of patients in the dataset who eventually die. Consequently, all patients will have a date of death (DOD). However, patients do not necessarily die during an individual hospital admission or ICU stay.
This project was approved by the Institutional Review Boards of Beth Israel Deaconess Medical Center (Boston, MA) and the Massachusetts Institute of Technology (Cambridge, MA). Requirement for individual patient consent was waived because the project did not impact clinical care and all protected health information was deidentified.
Data Description MIMIC-III is a relational database consisting of 26 tables. For a detailed description of the database structure, see the MIMIC-III Clinical Database page. The demo shares an identical schema, except all rows in the NOTEEVENTS table have been removed.
The data files are distributed in comma separated value (CSV) format following the RFC 4180 standard. Notably, string fields which contain commas, newlines, and/or double quotes are encapsulated by double quotes ("). Actual double quotes in the data are escaped using an additional double quote. For example, the string she said "the patient was notified at 6pm" would be stored in the CSV as "she said ""the patient was notified at 6pm""". More detail is provided on the RFC 4180 description page: https://tools.ietf.org/html/rfc4180
Usage Notes The MIMIC-III demo provides researchers with an opportunity to review the structure and content of MIMIC-III before deciding whether or not to carry out an analysis on the full dataset.
CSV files can be opened natively using any text editor or spreadsheet program. However, some tables are large, and it may be preferable to navigate the data stored in a relational database. One alternative is to create an SQLite database using the CSV files. SQLite is a lightweight database format which stores all constituent tables in a single file, and SQLite databases interoperate well with a number software tools.
DB Browser for SQLite is a high quality, visual, open source tool to create, design, and edit database files compatible with SQLite. We have found this tool to be useful for navigating SQLite files. Information regarding installation of the software and creation of the database can be found online: https://sqlitebrowser.org/
Release Notes Release notes for the demo follow the release notes for the MIMIC-III database.
Acknowledgements This research and development was supported by grants NIH-R01-EB017205, NIH-R01-EB001659, and NIH-R01-GM104987 from the National Institutes of Health. The authors would also like to thank Philips Healthcare and staff at the Beth Israel Deaconess Medical Center, Boston, for supporting database development, and Ken Pierce for providing ongoing support for the MIMIC research community.
Conflicts of Interest The authors declare no competing financial interests.
References Johnson, A. E. W., Pollard, T. J., Shen, L., Lehman, L. H., Feng, M., Ghassemi, M., Mo...
Facebook
TwitterThere are five different files for this dataset: 1. A dataset listing the reported functional uses of chemicals (FUse) 2. All 729 ToxPrint descriptors obtained from ChemoTyper for chemicals in FUse 3. All EPI Suite properties obtained for chemicals in FUse 4. The confusion matrix values, similarity thresholds, and bioactivity index for each model. 5. The functional use prediction, bioactivity index, and prediction classification (poor prediction, functional substitute, candidate alternative) for each Tox21 chemical. This dataset is associated with the following publication: Phillips, K., J. Wambaugh, C. Grulke, K. Dionisio, and K. Isaacs. High-throughput screening of chemicals as functional substitutes using structure-based classification models. GREEN CHEMISTRY. Royal Society of Chemistry, Cambridge, UK, 19: 1063-1074, (2017).