60 datasets found
  1. Data from: Regression with Empirical Variable Selection: Description of a...

    • plos.figshare.com
    txt
    Updated Jun 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anne E. Goodenough; Adam G. Hart; Richard Stafford (2023). Regression with Empirical Variable Selection: Description of a New Method and Application to Ecological Datasets [Dataset]. http://doi.org/10.1371/journal.pone.0034338
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 8, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Anne E. Goodenough; Adam G. Hart; Richard Stafford
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Despite recent papers on problems associated with full-model and stepwise regression, their use is still common throughout ecological and environmental disciplines. Alternative approaches, including generating multiple models and comparing them post-hoc using techniques such as Akaike's Information Criterion (AIC), are becoming more popular. However, these are problematic when there are numerous independent variables and interpretation is often difficult when competing models contain many different variables and combinations of variables. Here, we detail a new approach, REVS (Regression with Empirical Variable Selection), which uses all-subsets regression to quantify empirical support for every independent variable. A series of models is created; the first containing the variable with most empirical support, the second containing the first variable and the next most-supported, and so on. The comparatively small number of resultant models (n = the number of predictor variables) means that post-hoc comparison is comparatively quick and easy. When tested on a real dataset – habitat and offspring quality in the great tit (Parus major) – the optimal REVS model explained more variance (higher R2), was more parsimonious (lower AIC), and had greater significance (lower P values), than full, stepwise or all-subsets models; it also had higher predictive accuracy based on split-sample validation. Testing REVS on ten further datasets suggested that this is typical, with R2 values being higher than full or stepwise models (mean improvement = 31% and 7%, respectively). Results are ecologically intuitive as even when there are several competing models, they share a set of “core” variables and differ only in presence/absence of one or two additional variables. We conclude that REVS is useful for analysing complex datasets, including those in ecology and environmental disciplines.

  2. f

    All methods r with respect to certain subsets.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Dec 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Barnes, Jonathan E.; Martin, Kyle P.; Ytreberg, F. Marty; Gonzalez, Tawny R.; Patel, Jagdish Suresh (2020). All methods r with respect to certain subsets. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000490448
    Explore at:
    Dataset updated
    Dec 21, 2020
    Authors
    Barnes, Jonathan E.; Martin, Kyle P.; Ytreberg, F. Marty; Gonzalez, Tawny R.; Patel, Jagdish Suresh
    Description

    All methods r with respect to certain subsets.

  3. SDSS Galaxy Subset

    • zenodo.org
    application/gzip
    Updated Sep 5, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nuno Ramos Carvalho; Nuno Ramos Carvalho (2022). SDSS Galaxy Subset [Dataset]. http://doi.org/10.5281/zenodo.6696565
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Sep 5, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nuno Ramos Carvalho; Nuno Ramos Carvalho
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Sloan Digital Sky Survey (SDSS) is a comprehensive survey of the northern sky. This dataset contains a subset of this survey, of 60247 objects classified as galaxies, it includes a CSV file with a collection of information and a set of files for each object, namely JPG image files, FITS and spectra data. This dataset is used to train and explore the astromlp-models collection of deep learning models for galaxies characterisation.

    The dataset includes a CSV data file where each row is an object from the SDSS database, and with the following columns (note that some data may not be available for all objects):

    • objid: unique SDSS object identifier
    • mjd: MJD of observation
    • plate: plate identifier
    • tile: tile identifier
    • fiberid: fiber identifier
    • run: run number
    • rerun: rerun number
    • camcol: camera column
    • field: field number
    • ra: right ascension
    • dec: declination
    • class: spectroscopic class (only objetcs with GALAXY are included)
    • subclass: spectroscopic subclass
    • modelMag_u: better of DeV/Exp magnitude fit for band u
    • modelMag_g: better of DeV/Exp magnitude fit for band g
    • modelMag_r: better of DeV/Exp magnitude fit for band r
    • modelMag_i: better of DeV/Exp magnitude fit for band i
    • modelMag_z: better of DeV/Exp magnitude fit for band z
    • redshift: final redshift from SDSS data z
    • stellarmass: stellar mass extracted from the eBOSS Firefly catalog
    • w1mag: WISE W1 "standard" aperture magnitude
    • w2mag: WISE W2 "standard" aperture magnitude
    • w3mag: WISE W3 "standard" aperture magnitude
    • w4mag: WISE W4 "standard" aperture magnitude
    • gz2c_f: Galaxy Zoo 2 classification from Willett et al 2013
    • gz2c_s: simplified version of Galaxy Zoo 2 classification (labels set)

    Besides the CSV file a set of directories are included in the dataset, in each directory you'll find a list of files named after the objid column from the CSV file, with the corresponding data, the following directories tree is available:

    sdss-gs/
    ├── data.csv
    ├── fits
    ├── img
    ├── spectra
    └── ssel

    Where, each directory contains:

    • img: RGB images from the object in JPEG format, 150x150 pixels, generated using the SkyServer DR16 API
    • fits: FITS data subsets around the object across the u, g, r, i, z bands; cut is done using the ImageCutter library
    • spectra: full best fit spectra data from SDSS between 4000 and 9000 wavelengths
    • ssel: best fit spectra data from SDSS for specific selected intervals of wavelengths discussed by Sánchez Almeida 2010

    Changelog

    • v0.0.3 - Increase number of objects to ~80k.
    • v0.0.2 - Increase number of objects to ~60k.
    • v0.0.1 - Initial import.
  4. OpenWebText 2M Subset

    • kaggle.com
    Updated Mar 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nikhil R (2025). OpenWebText 2M Subset [Dataset]. https://www.kaggle.com/datasets/nikhilr612/openwebtext-2m-subset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 17, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Nikhil R
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    A subset of OpenWebText, an open-source recreation of OpenAI's internal WebText corpus. This subset contains ~2 million documents, mainly in English, scraped from the Web. Highly unstructed text data; not necessarily clean.

  5. Details of the 10 additional datasets (the top five datasets are on...

    • plos.figshare.com
    xls
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anne E. Goodenough; Adam G. Hart; Richard Stafford (2023). Details of the 10 additional datasets (the top five datasets are on species-habitat interactions; the second five datasets are wider biological datasets). [Dataset]. http://doi.org/10.1371/journal.pone.0034338.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Anne E. Goodenough; Adam G. Hart; Richard Stafford
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Details of the 10 additional datasets (the top five datasets are on species-habitat interactions; the second five datasets are wider biological datasets).

  6. Source Code - Characterizing Variability and Uncertainty for Parameter...

    • catalog.data.gov
    • s.cnmilf.com
    Updated May 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2025). Source Code - Characterizing Variability and Uncertainty for Parameter Subset Selection in PBPK Models [Dataset]. https://catalog.data.gov/dataset/source-code-characterizing-variability-and-uncertainty-for-parameter-subset-selection-in-p
    Explore at:
    Dataset updated
    May 1, 2025
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    Source Code for the manuscript "Characterizing Variability and Uncertainty for Parameter Subset Selection in PBPK Models" -- This R code generates the results presented in this manuscript; the zip folder contains PBPK model files (for chloroform and DCM) and corresponding scripts to compile the models, generate human equivalent doses, and run sensitivity analysis.

  7. d

    Six subsets of all native vascular plant species of California used by...

    • datadryad.org
    • search.dataone.org
    zip
    Updated Feb 21, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bruce G. Baldwin; Andrew H. Thornhill; William A. Freyman; David. D. Ackerly; Matthew M. Kling; Naia Morueta-Holme; Brent D. Mishler (2017). Six subsets of all native vascular plant species of California used by Baldwin et al. (2017), and the R script to use to extract the subsets from the master spatial file [Dataset]. http://doi.org/10.6078/D1G010
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 21, 2017
    Dataset provided by
    Dryad
    Authors
    Bruce G. Baldwin; Andrew H. Thornhill; William A. Freyman; David. D. Ackerly; Matthew M. Kling; Naia Morueta-Holme; Brent D. Mishler
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Feb 21, 2017
    Area covered
    California
    Description

    From: Baldwin, B.G., A.H. Thornhill, W.A. Freyman, D.D. Ackerly, M.M. Kling, N. Morueta-Holme, and B.D. Mishler. 2017. Species richness and endemism in the native flora of California. American Journal of Botany. 104: 487–501. http://www.amjbot.org/content/104/3/487.fullUse with the other Baldwin (2017) data sets linked below.

  8. Common Crawl Micro Subset English

    • kaggle.com
    zip
    Updated Apr 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nikhil R (2025). Common Crawl Micro Subset English [Dataset]. https://www.kaggle.com/datasets/nikhilr612/common-crawl-micro-subset-english
    Explore at:
    zip(5504236429 bytes)Available download formats
    Dataset updated
    Apr 10, 2025
    Authors
    Nikhil R
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    A subset of Common Crawl, extracted from Colossally Cleaned Common Crawl (C4) dataset with the additional constraint that extracted text safely encodes to ASCII. A Unigram tokenizer of vocabulary 12.228k tokens is provided, along with pre-tokenized data.

  9. Data from: Effects of nutrient enrichment on freshwater macrophyte and...

    • zenodo.org
    Updated Dec 13, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Floris K. Neijnens; Floris K. Neijnens; Hadassa Moreira; Hadassa Moreira; Melinda M.J. De Jonge; Melinda M.J. De Jonge; Bart B.H.P. Linssen; Mark A.J. Huijbregts; Mark A.J. Huijbregts; Gertjan W. Geerling; Gertjan W. Geerling; Aafke M. Schipper; Aafke M. Schipper; Bart B.H.P. Linssen (2023). Effects of nutrient enrichment on freshwater macrophyte and invertebrate abundance: A meta-analysis [Dataset]. http://doi.org/10.5281/zenodo.10372444
    Explore at:
    Dataset updated
    Dec 13, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Floris K. Neijnens; Floris K. Neijnens; Hadassa Moreira; Hadassa Moreira; Melinda M.J. De Jonge; Melinda M.J. De Jonge; Bart B.H.P. Linssen; Mark A.J. Huijbregts; Mark A.J. Huijbregts; Gertjan W. Geerling; Gertjan W. Geerling; Aafke M. Schipper; Aafke M. Schipper; Bart B.H.P. Linssen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The zip-file contains the data and code accompanying the paper 'Effects of nutrient enrichment on freshwater macrophyte and invertebrate abundance: A meta-analysis'. Together, these files should allow for the replication of the results.

    The 'raw_data' folder contains the 'MA_database.csv' file, which contains the extracted data from all primary studies that are used in the analysis. Furthermore, this folder contains the file 'MA_database_description.txt', which gives a description of each data column in the database.

    The 'derived_data' folder contains the files that are produced by the R-scripts in this study and used for data analysis. The 'MA_database_processed.csv' and 'MA_database_processed.RData' files contain the converted raw database that is suitable for analysis. The 'DB_IA_subsets.RData' file contains the 'Individual Abundance' (IA) data subsets based on taxonomic group (invertebrates/macrophytes) and inclusion criteria. The 'DB_IA_VCV_matrices.RData' contains for all IA data subsets the variance-covariance (VCV) matrices. The 'DB_AM_subsets.RData' file contains the 'Total Abundance' (TA) and 'Mean Abundance' (MA) data subsets based on taxonomic group (invertebrates/macrophytes) and inclusion criteria.

    The 'output_data' folder contains maps with the output data for each data subset (i.e. for each metric, taxonomic group and set of inclusion criteria). For each data subset, the map contains random effects selection results ('Results1_REsel_

    The 'scripts' folder contains all R-scripts that we used for this study. The 'PrepareData.R' script takes the database as input and adjusts the file so that it can be used for data analysis. The 'PrepareDataIA.R' and 'PrepareDataAM.R' scripts make subsets of the data and prepare the data for the meta-regression analysis and mixed-effects regression analysis, respectively. The regression analyses are performed in the 'SelectModelsIA.R' and 'SelectModelsAM.R' scripts to calculate the regression model results for the IA metric and MA/TA metrics, respectively. These scripts require the 'RandomAndFixedEffects.R' script, containing the random and fixed effects parameter combinations, as well as the 'Functions.R' script. The 'CreateMap.R' script creates a global map with the location of all studies included in the analysis (figure 1 in the paper). The 'CreateForestPlots.R' script creates plots showing the IA data distribution for both taxonomic groups (figure 2 in the paper). The 'CreateHeatMaps.R' script creates heat maps for all metrics and taxonomic groups (figure 3 in the paper, figures S11.1 and S11.2 in the appendix). The 'CalculateStatistics.R' script calculates the descriptive statistics that are reported throughout the paper, and creates the figures that describe the dataset characteristics (figures S3.1 to S3.5 in the appendix). The 'CreateFunnelPlots.R' script creates the funnel plots for both taxonomic groups (figures S6.1 and S6.2 in the appendix) and performs Egger's tests. The 'CreateControlGraphs.R' script creates graphs showing the dependency of the nutrient response to control concentrations for all metrics and taxonomic groups (figures S10.1 and S10.2 in the appendix).

    The 'figures' folder contains all figures that are included in this study.

  10. t

    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L. (2024)....

    • service.tib.eu
    Updated Dec 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L. (2024). Dataset: ImageNet Subsets. https://doi.org/10.57702/oetogsha [Dataset]. https://service.tib.eu/ldmservice/dataset/imagenet-subsets
    Explore at:
    Dataset updated
    Dec 16, 2024
    Description

    ImageNet Subsets

  11. d

    Data release for solar-sensor angle analysis subset associated with the...

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Nov 27, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Data release for solar-sensor angle analysis subset associated with the journal article "Solar and sensor geometry, not vegetation response, drive satellite NDVI phenology in widespread ecosystems of the western United States" [Dataset]. https://catalog.data.gov/dataset/data-release-for-solar-sensor-angle-analysis-subset-associated-with-the-journal-article-so
    Explore at:
    Dataset updated
    Nov 27, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    United States, Western United States
    Description

    This dataset provides geospatial location data and scripts used to analyze the relationship between MODIS-derived NDVI and solar and sensor angles in a pinyon-juniper ecosystem in Grand Canyon National Park. The data are provided in support of the following publication: "Solar and sensor geometry, not vegetation response, drive satellite NDVI phenology in widespread ecosystems of the western United States". The data and scripts allow users to replicate, test, or further explore results. The file GrcaScpnModisCellCenters.csv contains locations (latitude-longitude) of all the 250-m MODIS (MOD09GQ) cell centers associated with the Grand Canyon pinyon-juniper ecosystem that the Southern Colorado Plateau Network (SCPN) is monitoring through its land surface phenology and integrated upland monitoring programs. The file SolarSensorAngles.csv contains MODIS angle measurements for the pixel at the phenocam location plus a random 100 point subset of pixels within the GRCA-PJ ecosystem. The script files (folder: 'Code') consist of 1) a Google Earth Engine (GEE) script used to download MODIS data through the GEE javascript interface, and 2) a script used to calculate derived variables and to test relationships between solar and sensor angles and NDVI using the statistical software package 'R'. The file Fig_8_NdviSolarSensor.JPG shows NDVI dependence on solar and sensor geometry demonstrated for both a single pixel/year and for multiple pixels over time. (Left) MODIS NDVI versus solar-to-sensor angle for the Grand Canyon phenocam location in 2018, the year for which there is corresponding phenocam data. (Right) Modeled r-squared values by year for 100 randomly selected MODIS pixels in the SCPN-monitored Grand Canyon pinyon-juniper ecosystem. The model for forward-scatter MODIS-NDVI is log(NDVI) ~ solar-to-sensor angle. The model for back-scatter MODIS-NDVI is log(NDVI) ~ solar-to-sensor angle + sensor zenith angle. Boxplots show interquartile ranges; whiskers extend to 10th and 90th percentiles. The horizontal line marking the average median value for forward-scatter r-squared (0.835) is nearly indistinguishable from the back-scatter line (0.833). The dataset folder also includes supplemental R-project and packrat files that allow the user to apply the workflow by opening a project that will use the same package versions used in this study (eg, .folders Rproj.user, and packrat, and files .RData, and PhenocamPR.Rproj). The empty folder GEE_DataAngles is included so that the user can save the data files from the Google Earth Engine scripts to this location, where they can then be incorporated into the r-processing scripts without needing to change folder names. To successfully use the packrat information to replicate the exact processing steps that were used, the user should refer to packrat documentation available at https://cran.r-project.org/web/packages/packrat/index.html and at https://www.rdocumentation.org/packages/packrat/versions/0.5.0. Alternatively, the user may also use the descriptive documentation phenopix package documentation, and description/references provided in the associated journal article to process the data to achieve the same results using newer packages or other software programs.

  12. Data Mining Project - Boston

    • kaggle.com
    zip
    Updated Nov 25, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SophieLiu (2019). Data Mining Project - Boston [Dataset]. https://www.kaggle.com/sliu65/data-mining-project-boston
    Explore at:
    zip(59313797 bytes)Available download formats
    Dataset updated
    Nov 25, 2019
    Authors
    SophieLiu
    Area covered
    Boston
    Description

    Context

    To make this a seamless process, I cleaned the data and delete many variables that I thought were not important to our dataset. I then uploaded all of those files to Kaggle for each of you to download. The rideshare_data has both lyft and uber but it is still a cleaned version from the dataset we downloaded from Kaggle.

    Use of Data Files

    You can easily subset the data into the car types that you will be modeling by first loading the csv into R, here is the code for how you do this:

    This loads the file into R

    df<-read.csv('uber.csv')

    The next codes is to subset the data into specific car types. The example below only has Uber 'Black' car types.

    df_black<-subset(uber_df, uber_df$name == 'Black')

    This next portion of code will be to load it into R. First, we must write this dataframe into a csv file on our computer in order to load it into R.

    write.csv(df_black, "nameofthefileyouwanttosaveas.csv")

    The file will appear in you working directory. If you are not familiar with your working directory. Run this code:

    getwd()

    The output will be the file path to your working directory. You will find the file you just created in that folder.

    Inspiration

    Your data will be in front of the world's largest data science community. What questions do you want to see answered?

  13. ECG Chagas Disease [Balanced]

    • kaggle.com
    zip
    Updated Feb 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matteo Fasulo (2025). ECG Chagas Disease [Balanced] [Dataset]. https://www.kaggle.com/datasets/matteofasuloo/code15-ecg-chagas-balanced/code
    Explore at:
    zip(741625662 bytes)Available download formats
    Dataset updated
    Feb 3, 2025
    Authors
    Matteo Fasulo
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This code is not mine. The dataset provided here is a balanced subset derived from the original dataset, and I do not claim ownership over the original data.

    The CODE dataset was collected by the Telehealth Network of Minas Gerais (TNMG) in the period between 2010 and 2016. TNMG is a public telehealth system assisting 811 out of the 853 municipalities in the state of Minas Gerais, Brazil.

    The CODE 15% dataset is obtained from stratified sampling from the CODE dataset. This subset of the CODE dataset is described in and used for assessing model performance:

    "Deep neural network estimated electrocardiographic-age as a mortality predictor"
    Emilly M Lima, Antônio H Ribeiro, Gabriela MM Paixão, Manoel Horta Ribeiro, Marcelo M Pinto Filho, Paulo R Gomes, Derick M Oliveira, Ester C Sabino, Bruce B Duncan, Luana Giatti, Sandhi M Barreto, Wagner Meira Jr, Thomas B Schön, Antonio Luiz P Ribeiro. MedRXiv (2021) https://www.doi.org/10.1101/2021.02.19.21251232

    This dataset is a subset of the CODE 15% dataset obtained by random sampling from the negative class while maintaining all the observations of the positive class to create a balanced dataset without the need to focus on class imbalance.

    The code15_hdf5 folder contains the exams and labels for the entire CODE 15% dataset. The code15_wfdb folder contains the exam records file in .dat format.

    An additional file (signals_features.csv) is provided, containing handcrafted features from the ECG records (lead II) related to P, Q, R, S, and T waves. Features such as P wave duration, PR interval, PR segment, QRS duration, ST segment, and ST slope were computed by first extracting all the points using the neurokit2 Python library and then aggregated for each record ID using descriptive statistics. Heart rate variability features were also included along with the P, Q, R, S, and T waves.

    Link to the original dataset: https://doi.org/10.5281/zenodo.4916206

  14. IndicVoices-R-Tamil

    • kaggle.com
    zip
    Updated Apr 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    VishwaRamsundar007 (2025). IndicVoices-R-Tamil [Dataset]. https://www.kaggle.com/datasets/vishwaramsundar007/indicvoices-r-tamil/code
    Explore at:
    zip(28514620182 bytes)Available download formats
    Dataset updated
    Apr 7, 2025
    Authors
    VishwaRamsundar007
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    IndicVoices-R (IV-R) is the largest multilingual Indian text-to-speech (TTS) dataset derived from an automatic speech recognition (ASR) dataset. It contains 1,704 hours of high-quality speech from 10,496 speakers across 22 Indian languages. This subset contains Tamil language audio and metadata prepared for TTS research and benchmarking.

    Key Features: - Speaker Diversity: Includes Tamil speakers across various demographics - High-Quality Samples: Audio restored from ASR-quality speech using HTDemucs, VoiceFixer, and DeepFilterNet3 - Natural Conversational Speech: Most recordings are extempore - Metadata: Includes speaker info, pitch, SNR, C50, duration, etc. - File Format: Audio in .wav (48 kHz), plus text, verbatim, normalized formats

    License: CC-BY-4.0

    Acknowledgements: Supported by Digital India Bhashini, EkStep Foundation, and Nilekani Philanthropies. Enhanced using PARAM-Siddhi supercomputing resources by CDAC Pune and supported by AI4Bharat team.

  15. Sounder SIPS: Aqua AIRS Level-1B Calibration Subset: Summary, V2...

    • data.nasa.gov
    Updated Apr 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). Sounder SIPS: Aqua AIRS Level-1B Calibration Subset: Summary, V2 (SNDRAQIML1BCALSUBSUM) at GES DISC - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/sounder-sips-aqua-airs-level-1b-calibration-subset-summary-v2-sndraqiml1bcalsubsum-at-ges--4f6e9
    Explore at:
    Dataset updated
    Apr 1, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    The Atmospheric Infrared Sounder (AIRS) is a grating spectrometer (R = 1200) aboard the second Earth Observing System (EOS) polar-orbiting platform, EOS Aqua. AIRS/Aqua Level-1C calibration subset including clear cases, special calibration sites, random nadir spots, and high clouds. Infrared temperature sounders generate a large amount of Level-1B spectral data. For example, the AIRS instrument with 2378 channels, its visible light component and AMSU with 15 channels create 3x240 files each day, for a total of over 500 MB of data. The purpose of the Calibration Data Subsets is extract key information from these data into a few daily files to: 1. Facilitate a quick evaluation of the absolute calibration of the instruments. 2. Facilitate an assessment of the instrument performance under clear, cloudy, and extreme hot and cold conditions. 3. Facilitate the evaluation of instrument trends and their significance relative to climate trends. 4. Facilitate the comparison of AIRS with CrIS using their equivalent data subsets.The output files are constructed from Level-1B or Level-1C IR and MW brightness or antenna temperatures. Each file contains selected observations taken from a nominal 24-hour period. The “summary” product includes a large set of cases of interest, including all identified spectra that match selection criteria detailed below for clear, special cloud classes, etc. These amount to about 10% of all spectra. But for each selected case only brightness temperatures (BTs) for selected key channels are saved.

  16. f

    Vegetation variables in the case study dataset.

    • figshare.com
    • plos.figshare.com
    xls
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anne E. Goodenough; Adam G. Hart; Richard Stafford (2023). Vegetation variables in the case study dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0034338.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Anne E. Goodenough; Adam G. Hart; Richard Stafford
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Vegetation variables in the case study dataset.

  17. f

    Distribution of predictor variables among different subsets within the...

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Feb 19, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Been, Jasper V.; Andriessen, Peter; van Dongen, Martien C. J. M.; van Gool, Christel J. A. W.; de Rooij, Jasmijn D. E.; Kramer, Boris W.; Kornelisse, René F.; Rours, G. Ingrid J. G.; de Krijger, Ronald R.; Zimmermann, Luc J. I.; Vanterpool, Sizzle F. (2013). Distribution of predictor variables among different subsets within the derivation and validation cohorts. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001659359
    Explore at:
    Dataset updated
    Feb 19, 2013
    Authors
    Been, Jasper V.; Andriessen, Peter; van Dongen, Martien C. J. M.; van Gool, Christel J. A. W.; de Rooij, Jasmijn D. E.; Kramer, Boris W.; Kornelisse, René F.; Rours, G. Ingrid J. G.; de Krijger, Ronald R.; Zimmermann, Luc J. I.; Vanterpool, Sizzle F.
    Description

    Abbreviations: HC = histological chorioamnionitis; HCF = histological chorioamnionitis with fetal involvement.

  18. OpenML R Bot Benchmark Data (final subset)

    • figshare.com
    application/gzip
    Updated May 18, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Kühn; Philipp Probst; Janek Thomas; Bernd Bischl (2018). OpenML R Bot Benchmark Data (final subset) [Dataset]. http://doi.org/10.6084/m9.figshare.5882230.v2
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    May 18, 2018
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Daniel Kühn; Philipp Probst; Janek Thomas; Bernd Bischl
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is a clean subset of the data that was created by the OpenML R Bot that executed benchmark experiments on binary classification task of the OpenML100 benchmarking suite with six R algorithms: glmnet, rpart, kknn, svm, ranger and xgboost. The hyperparameters of these algorithms were drawn randomly. In total it contains more than 2.6 million benchmark experiments and can be used by other researchers. The subset was created by taking 500000 results of each learner (except of kknn for which only 1140 results are available). The csv-file for each learner is a table that for each benchmark experiment has a row that contains: OpenML-Data ID, hyperparameter values, performance measures (AUC, accuracy, brier score), runtime, scimark (runtime reference of the machine), and some meta features of the dataset.OpenMLRandomBotResults.RData (format for R) contains all data in seperate tables for the results, the hyperparameters, the meta features, the runtime, the scimark results and reference results.

  19. WABI Subset: Police

    • researchdata.edu.au
    Updated Jul 29, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    State Library of Western Australia (2016). WABI Subset: Police [Dataset]. https://researchdata.edu.au/wabi-subset-police/2994547
    Explore at:
    Dataset updated
    Jul 29, 2016
    Dataset provided by
    Data.govhttps://data.gov/
    Authors
    State Library of Western Australia
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Area covered
    Description

    This index was compiled by Miss Mollie Bentley from various records she has used relating to the police. These include: Almanac listings, Colonial Secretary's Office Records, Police Gazettes, various police department occurrence books and letter books, police journals, government gazettes, estimates, York police records etc.\r \r Entry is by name of policeman. Information given varies but is usually about appointments, promotions, retirements, transfers etc.\r \r The Western Australian Biographical Index (WABI) is a highly used resource at the State Library of Western Australia. A recent generous contribution by the Friends of Battye Library (FOBS) has enabled SLWA to have the original handwritten index cards scanned and later transcribed.\r \r The dataset contains: several csv files with data describing card number, card text and url link to image of the original handwritten card.\r \r The transcription was crowd-sourced and we are aware that there are some data quality issues including:\r \r * Some cards are missing\r * Transcripts are crowdsourced so may contain spelling errors and possibly missing information\r * Some cards are crossed out. Some of these are included in the collection and some are not\r * Some of the cards contain relevant information on the back (usually children of the person mentioned). This info should be on the next consecutive card\r * As the information is an index, collected in the 1970s from print material, it is incomplete. It is also unreferenced.\r It is still a very valuable dataset as it contains a wealth of information about early settlers in Western Australia. It is of particular interest to genealogists and historians.

  20. BELKA: supplementary calcs & data for ML

    • kaggle.com
    zip
    Updated Jul 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Antonina Dolgorukova (2024). BELKA: supplementary calcs & data for ML [Dataset]. https://www.kaggle.com/datasets/antoninadolgorukova/belka-supplementary-calcs-and-data-for-ml
    Explore at:
    zip(159717156 bytes)Available download formats
    Dataset updated
    Jul 9, 2024
    Authors
    Antonina Dolgorukova
    License

    https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/

    Description

    All stored data and calculations are for the Leash Bio - Predict New Medicines with BELKA competition, and based on its corresponding datasets.

    "Train subsets" folder (the subset will not be changed or reuploaded): The data from the parquet file with the train dataset, containing SMILES (Simplified Molecular-Input Line-Entry System) for ~295M molecules and the labels as binary binding classifications, one per protein target out of three targets, were divided into 3 subsets for each protein, so that each subset contains all molecules that bind to it and the same number of random molecules that do not bind to it.

    The subsets contain:
    HSA protein: 816, 820 molecules
    sEH protein: 1,449,064 molecules
    BRD4 protein: 913,928 molecules

    The subsets are stored in fast and readable in r qs format, which you can read as follows:

    library(qs)
    dt <- qread("/kaggle/input/belka-supplementary-calcs-and-data-for-ml/train subsets/BRD4_all_bind1_rand_bind0.qs")
    

    "Smiles" folder Contains all unique SMILES from all the three building blocks of the test and train sets in a scv (all_bb_smiles_by_bb.csv) and SDF format (all_bb_smiles.sdf). The conversion can be easily done with the ChemmineOB package and the OpenBabel software, but they unfortunately they are not available in kagge.

    all_bb_sdfset <- smiles2sdf(named_vector_of_smiles)
    write.SDF(all_bb_sdfset, file = "all_bb_smiles_sdfset.sdf")
    

    "Features" folder Contains different features in csv format, genereted in this notebook

    "for sim analysis" folder Contains some precalculated data for this notebook investigating similarity between train and test molecules and a possible way to track generalizability of an ML model.

    bb features folder Different molecular representations of train and test modecules' building blocks

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Anne E. Goodenough; Adam G. Hart; Richard Stafford (2023). Regression with Empirical Variable Selection: Description of a New Method and Application to Ecological Datasets [Dataset]. http://doi.org/10.1371/journal.pone.0034338
Organization logo

Data from: Regression with Empirical Variable Selection: Description of a New Method and Application to Ecological Datasets

Related Article
Explore at:
38 scholarly articles cite this dataset (View in Google Scholar)
txtAvailable download formats
Dataset updated
Jun 8, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Anne E. Goodenough; Adam G. Hart; Richard Stafford
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Despite recent papers on problems associated with full-model and stepwise regression, their use is still common throughout ecological and environmental disciplines. Alternative approaches, including generating multiple models and comparing them post-hoc using techniques such as Akaike's Information Criterion (AIC), are becoming more popular. However, these are problematic when there are numerous independent variables and interpretation is often difficult when competing models contain many different variables and combinations of variables. Here, we detail a new approach, REVS (Regression with Empirical Variable Selection), which uses all-subsets regression to quantify empirical support for every independent variable. A series of models is created; the first containing the variable with most empirical support, the second containing the first variable and the next most-supported, and so on. The comparatively small number of resultant models (n = the number of predictor variables) means that post-hoc comparison is comparatively quick and easy. When tested on a real dataset – habitat and offspring quality in the great tit (Parus major) – the optimal REVS model explained more variance (higher R2), was more parsimonious (lower AIC), and had greater significance (lower P values), than full, stepwise or all-subsets models; it also had higher predictive accuracy based on split-sample validation. Testing REVS on ten further datasets suggested that this is typical, with R2 values being higher than full or stepwise models (mean improvement = 31% and 7%, respectively). Results are ecologically intuitive as even when there are several competing models, they share a set of “core” variables and differ only in presence/absence of one or two additional variables. We conclude that REVS is useful for analysing complex datasets, including those in ecology and environmental disciplines.

Search
Clear search
Close search
Google apps
Main menu