100+ datasets found
  1. Company Datasets for Business Profiling

    • datarade.ai
    Updated Feb 23, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oxylabs (2017). Company Datasets for Business Profiling [Dataset]. https://datarade.ai/data-products/company-datasets-for-business-profiling-oxylabs
    Explore at:
    .json, .xml, .csv, .xlsAvailable download formats
    Dataset updated
    Feb 23, 2017
    Dataset authored and provided by
    Oxylabs
    Area covered
    British Indian Ocean Territory, Bangladesh, Moldova (Republic of), Northern Mariana Islands, Canada, Isle of Man, Nepal, Taiwan, Andorra, Tunisia
    Description

    Company Datasets for valuable business insights!

    Discover new business prospects, identify investment opportunities, track competitor performance, and streamline your sales efforts with comprehensive Company Datasets.

    These datasets are sourced from top industry providers, ensuring you have access to high-quality information:

    • Owler: Gain valuable business insights and competitive intelligence. -AngelList: Receive fresh startup data transformed into actionable insights. -CrunchBase: Access clean, parsed, and ready-to-use business data from private and public companies. -Craft.co: Make data-informed business decisions with Craft.co's company datasets. -Product Hunt: Harness the Product Hunt dataset, a leader in curating the best new products.

    We provide fresh and ready-to-use company data, eliminating the need for complex scraping and parsing. Our data includes crucial details such as:

    • Company name;
    • Size;
    • Founding date;
    • Location;
    • Industry;
    • Revenue;
    • Employee count;
    • Competitors.

    You can choose your preferred data delivery method, including various storage options, delivery frequency, and input/output formats.

    Receive datasets in CSV, JSON, and other formats, with storage options like AWS S3 and Google Cloud Storage. Opt for one-time, monthly, quarterly, or bi-annual data delivery.

    With Oxylabs Datasets, you can count on:

    • Fresh and accurate data collected and parsed by our expert web scraping team.
    • Time and resource savings, allowing you to focus on data analysis and achieving your business goals.
    • A customized approach tailored to your specific business needs.
    • Legal compliance in line with GDPR and CCPA standards, thanks to our membership in the Ethical Web Data Collection Initiative.

    Pricing Options:

    Standard Datasets: choose from various ready-to-use datasets with standardized data schemas, priced from $1,000/month.

    Custom Datasets: Tailor datasets from any public web domain to your unique business needs. Contact our sales team for custom pricing.

    Experience a seamless journey with Oxylabs:

    • Understanding your data needs: We work closely to understand your business nature and daily operations, defining your unique data requirements.
    • Developing a customized solution: Our experts create a custom framework to extract public data using our in-house web scraping infrastructure.
    • Delivering data sample: We provide a sample for your feedback on data quality and the entire delivery process.
    • Continuous data delivery: We continuously collect public data and deliver custom datasets per the agreed frequency.

    Unlock the power of data with Oxylabs' Company Datasets and supercharge your business insights today!

  2. Dataset for Exploring case-control samples with non-targeted analysis

    • catalog.data.gov
    • datasets.ai
    Updated Nov 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2020). Dataset for Exploring case-control samples with non-targeted analysis [Dataset]. https://catalog.data.gov/dataset/dataset-for-exploring-case-control-samples-with-non-targeted-analysis
    Explore at:
    Dataset updated
    Nov 12, 2020
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    These data contain the results of GC-MS, LC-MS and immunochemistry analyses of mask sample extracts. The data include tentatively identified compounds through library searches and compound abundance. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: The data can not be accessed. Format: The dataset contains the identification of compounds found in the mask samples as well as the abundance of those compounds for individuals who participated in the trial. This dataset is associated with the following publication: Pleil, J., M. Wallace, J. McCord, M. Madden, J. Sobus, and G. Ferguson. How do cancer-sniffing dogs sort biological samples? Exploring case-control samples with non-targeted LC-Orbitrap, GC-MS, and immunochemistry methods. Journal of Breath Research. Institute of Physics Publishing, Bristol, UK, 14(1): 016006, (2019).

  3. AI-generated USA DL — 5k images

    • kaggle.com
    zip
    Updated Oct 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    simon graves (2025). AI-generated USA DL — 5k images [Dataset]. https://www.kaggle.com/datasets/simongraves/synthetic-usa-driver-license
    Explore at:
    zip(1108799394 bytes)Available download formats
    Dataset updated
    Oct 19, 2025
    Authors
    simon graves
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Area covered
    United States
    Description

    Synthetic License Dataset - 5 000 passport photos

    Dataset features 5,000 high-quality, AI-generated images of U.S. driver licenses from multiple states, including California, Texas, and New York. Designed for OCR, data extraction, and identity verification model training, this synthetic dataset ensures realistic detail, balanced demographics, and secure handling of personally identifiable information. - Get the data

    Dataset characteristics:

    CharacteristicData
    DescriptionPrinted synthetic driver license images for training ML models in PII extraction
    Data typesImage
    TasksOCR, Computer Vision
    Total number of files5,000
    Number of files in a set96 (Angles - 3, Lighting - 4, Backgrounds - 4, Distances - 2)
    StatesCalifornia, Florida, New York, Pennsylvania, Texas
    Angles0°, 25°, 45°
    LightingNatural-daylight, Office-LED, Warm-indoor, Dim-light
    BackgroundsNeutral wall, Textured desk, Outdoor pavement, Docs-on-docs
    DistanceClose (80-90% frame), Medium (50-60% frame)
    LabelingMetadata (License ID, Sample ID, Class, Gender, Age Group, Angle, Distance, Category, Resolution, Camera, Light Condition, Background, Timestamp)
    GenderMale, Female

    Here's a sample dataset to check out. For full access, go here

    Dataset structure

    • USA_C - folder of images (California)
    • USA_F - folder of images (Florida)
    • USA_NY - folder of images (New York)
    • USA_P - folder of images (Pennsylvania)
    • USA_T - folder of images (Texas)
    • Driver license metadata.xlsx - file contains metadata for all individuals in the dataset.

    Similar Datasets:

    1. Synthetic Passports Dataset
    2. Synthetic USA Driver License Dataset
    3. Synthetic Printed Mexican Passports Dataset
  4. Developer Community and Code Datasets

    • datarade.ai
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oxylabs, Developer Community and Code Datasets [Dataset]. https://datarade.ai/data-products/developer-community-and-code-datasets-oxylabs
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset authored and provided by
    Oxylabs
    Area covered
    El Salvador, Philippines, Tuvalu, Bahamas, Guyana, Saint Pierre and Miquelon, Marshall Islands, South Sudan, United Kingdom, Djibouti
    Description

    Unlock the power of ready-to-use data sourced from developer communities and repositories with Developer Community and Code Datasets.

    Data Sources:

    1. GitHub: Access comprehensive data about GitHub repositories, developer profiles, contributions, issues, social interactions, and more.

    2. StackShare: Receive information about companies, their technology stacks, reviews, tools, services, trends, and more.

    3. DockerHub: Dive into data from container images, repositories, developer profiles, contributions, usage statistics, and more.

    Developer Community and Code Datasets are a treasure trove of public data points gathered from tech communities and code repositories across the web.

    With our datasets, you'll receive:

    • Usernames;
    • Companies;
    • Locations;
    • Job Titles;
    • Follower Counts;
    • Contact Details;
    • Employability Statuses;
    • And More.

    Choose from various output formats, storage options, and delivery frequencies:

    • Get datasets in CSV, JSON, or other preferred formats.
    • Opt for data delivery via SFTP or directly to your cloud storage, such as AWS S3.
    • Receive datasets either once or as per your agreed-upon schedule.

    Why choose our Datasets?

    1. Fresh and accurate data: Access complete, clean, and structured data from scraping professionals, ensuring the highest quality.

    2. Time and resource savings: Let us handle data extraction and processing cost-effectively, freeing your resources for strategic tasks.

    3. Customized solutions: Share your unique data needs, and we'll tailor our data harvesting approach to fit your requirements perfectly.

    4. Legal compliance: Partner with a trusted leader in ethical data collection. Oxylabs is trusted by Fortune 500 companies and adheres to GDPR and CCPA standards.

    Pricing Options:

    Standard Datasets: choose from various ready-to-use datasets with standardized data schemas, priced from $1,000/month.

    Custom Datasets: Tailor datasets from any public web domain to your unique business needs. Contact our sales team for custom pricing.

    Experience a seamless journey with Oxylabs:

    • Understanding your data needs: We work closely to understand your business nature and daily operations, defining your unique data requirements.
    • Developing a customized solution: Our experts create a custom framework to extract public data using our in-house web scraping infrastructure.
    • Delivering data sample: We provide a sample for your feedback on data quality and the entire delivery process.
    • Continuous data delivery: We continuously collect public data and deliver custom datasets per the agreed frequency.

    Empower your data-driven decisions with Oxylabs Developer Community and Code Datasets!

  5. g

    Sister Study / EnviroAtlas Community Sample

    • gimi9.com
    • catalog.data.gov
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sister Study / EnviroAtlas Community Sample [Dataset]. https://gimi9.com/dataset/data-gov_sister-study-enviroatlas-community-sample/
    Explore at:
    Description

    Sister Study is a prospective cohort of 50,884 U.S. women aged 35 to 74 years old conducted by the NIEHS. Eligible participants are women without a history of breast cancer but with at least one sister diagnosed with breast cancer at enrollment during 2003 - 2009. Datasets used in this research effort include health outcomes, lifestyle factors, socioeconomic factors, medication history, and built and natural environment factors. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: Contact NIEHS Sister Study (https://sisterstudy.niehs.nih.gov/English/index1.htm) for data access. Format: Datasets are provided in SAS and/or CSV format.

  6. Data from: Clinical Dataset

    • kaggle.com
    zip
    Updated Oct 5, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohamadreza Momeni (2023). Clinical Dataset [Dataset]. https://www.kaggle.com/datasets/imtkaggleteam/clinical-dataset
    Explore at:
    zip(16220 bytes)Available download formats
    Dataset updated
    Oct 5, 2023
    Authors
    Mohamadreza Momeni
    Description

    The purest type of electronic clinical data which is obtained at the point of care at a medical facility, hospital, clinic or practice. Often referred to as the electronic medical record (EMR), the EMR is generally not available to outside researchers. The data collected includes administrative and demographic information, diagnosis, treatment, prescription drugs, laboratory tests, physiologic monitoring data, hospitalization, patient insurance, etc.

    Individual organizations such as hospitals or health systems may provide access to internal staff. Larger collaborations, such as the NIH Collaboratory Distributed Research Network provides mediated or collaborative access to clinical data repositories by eligible researchers. Additionally, the UW De-identified Clinical Data Repository (DCDR) and the Stanford Center for Clinical Informatics allow for initial cohort identification.

    About Dataset:

    333 scholarly articles cite this dataset.

    Unique identifier: DOI

    Dataset updated: 2023

    Authors: Haoyang Mi

    In this dataset, we have two dataset:

    1- Clinical Data_Discovery_Cohort: Name of columns: Patient ID Specimen date Dead or Alive Date of Death Date of last Follow Sex Race Stage Event Time

    2- Clinical_Data_Validation_Cohort Name of columns: Patient ID Survival time (days) Event Tumor size Grade Stage Age Sex Cigarette Pack per year Type Adjuvant Batch EGFR KRAS

    Feel free to put your thought and analysis in a notebook for this datasets. And you can create some interesting and valuable ML projects for this case. Thanks for your attention.

  7. Facebook Datasets

    • brightdata.com
    .json, .csv, .xlsx
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bright Data, Facebook Datasets [Dataset]. https://brightdata.com/products/datasets/facebook
    Explore at:
    .json, .csv, .xlsxAvailable download formats
    Dataset authored and provided by
    Bright Datahttps://brightdata.com/
    License

    https://brightdata.com/licensehttps://brightdata.com/license

    Area covered
    Worldwide
    Description

    Access our extensive Facebook datasets that provide detailed information on public posts, pages, and user engagement. Gain insights into post performance, audience interactions, page details, and content trends with our ethically sourced data. Free samples are available for evaluation. Over 940M records available Price starts at $250/100K records Data formats are available in JSON, NDJSON, CSV, XLSX and Parquet. 100% ethical and compliant data collection Included datapoints:

    Post ID Post Content & URL Date Posted Hashtags Number of Comments Number of Shares Likes & Reaction Counts (by type) Video View Count Page Name & Category Page Followers & Likes Page Verification Status Page Website & Contact Info Is Sponsored Post Attachments (Images/Videos) External Link Data And much more

  8. r

    Gplates sample data: open access datasets to accompany GPlates software

    • researchdata.edu.au
    Updated May 3, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The University of Sydney (2012). Gplates sample data: open access datasets to accompany GPlates software [Dataset]. https://researchdata.edu.au/gplates-sample-data-gplates-software/11150
    Explore at:
    Dataset updated
    May 3, 2012
    Dataset provided by
    The University of Sydney
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Dataset funded by
    Australian Research Council
    Description

    GPlates sample data provides open access to GPlates-compatible data files that can be loaded seamlessly in GPlates. GPlates is a desktop software application for the interactive visualisation of plate-tectonics. GPlates is free software (open-source software), developed by EarthByte (part of AuScope) at the University of Sydney, The Division of Geological and Planetary Sciences at CalTech, and the Center for Geodynamics at the Norwegian Geological Survey. GPlates is licensed for distribution under the GNU General Public License (GPL), version 2. GPlates sample data files include feature data, rasters and time-dependent raster images, and global plate polygon files (plate models).

    Feature data are available in GPlates Markup Language (.gpml), .PLATES4 (.dat), ESRI Shapefile (.shp) and longditude and latitude with header record (.xy) formats. Files include:

    • EarthByte Global Rotation Model
    • EarthByte Global Plate ID Assignments
    • EarthByte Global Coastline File
    • EarthByte Global Present Day Isochron File
    • EarthByte Global Continent-Ocean Boundary File
    • EarthByte Global Spreading Ridge File

    GPlates-compatible present-day rasters and time-dependent raster images include:

    • Global Present Day Agegrid for use in raster reconstructions
    • Global Time-Dependent Oceanic Crustal Age Files (0-140 Ma)
    • Global Time-Dependent Oceanic Seafloor Spreading Rate Files (0-140 Ma)
    • Global Time-Dependent Dynamic Topography Files (0-70 Ma)
    • Global Age-Coded Seismic Tomography Files (0-200 Ma)

    Global Plate Polygon Files (plate models) include:

    • EarthByte Plate Model 2009 Present Day Static Plate Polygon Files. This file uses static polygons to represent the boundaries of present day tectonic plates and palaeo-plate boundaries. This file is compatible with other GIS software (including ArcGIS, PaleoGIS, Quantum GIS, GRASS GIS, SAGA GIS) and technical computing programs such as Matlab. For further information, please refer to the associated documentation.
    • Gurnis et. al. Dynamic Closed Plate Polygon Data Files. In this file, polygons are used to represent continuously closing plates from 140 Ma to the present. For further information, please refer to the associated publication.

    For further information, please refer to the GPlates Sample Data page on the EarthByte website and the GPlates website.

  9. Z

    Sample datasets used in [Gil et al 2017] for AAAI 2017

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adusumilli, Ravali (2020). Sample datasets used in [Gil et al 2017] for AAAI 2017 [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_180716
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Stanford School of Medicine
    Authors
    Adusumilli, Ravali
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This file includes the pointer to the 42 patient ids and zip file names of the 84 genomic and proteomic datasets used for the paper "Gil, Y, Garijo, D, Ratnakar, V, Mayani, R, Adusumilli, A, Srivastava, R, Boyce, H, Mallick,P. Towards Continuous Scientific Data Analysis and Hypothesis Evolution", accepted in AAAI 2017.

    The datasets itself are not published due to their size and access conditions. They can be retrieved with the provided ids from TCGA (https://gdc-portal.nci.nih.gov/legacy-archive/search/f) and CPTAC (https://cptac-data-portal.georgetown.edu/cptac/s/S022) archives.

    These patient ids are a subset of the nearly 90 samples used in "Zhang, B., Wang, J., Wang, X., Zhu, J., Liu, Q., et al. Proteogenomic characterization of human colon and rectal cancer. Nature 513,382–387", in order to test the system described in the AAAI 2017 paper. More samples were not included in the analysis due to time constraints.

  10. Data from: Sizing the Problem of Improving Discovery and Access to...

    • figshare.com
    xlsx
    Updated Jan 19, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kevin Read (2016). Sizing the Problem of Improving Discovery and Access to NIH-funded Data: A preliminary study [Dataset]. http://doi.org/10.6084/m9.figshare.1285515.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jan 19, 2016
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Kevin Read
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    To inform efforts to improve the discoverability of and access to biomedical datasets by providing a preliminary estimate of the number and type of datasets generated annually by National Institutes of Health (NIH)-funded researchers. Of particular interest is characterizing those datasets that are not deposited in a known data repository or registry, e.g., those for which a related journal article does not indicate that underlying data have been deposited in a known repository. Such “invisible” datasets comprise the “long tail” of biomedical data and pose significant practical challenges to ongoing efforts to improve discoverability of and access to biomedical research data. This study identified datasets used to support the NIH-funded research reported in articles published in 2011 and cited in PubMed® and deposited in PubMed Central® (PMC). After searching for all articles that acknowledged NIH support, we first identified articles that contained explicit mention of datasets being deposited in recognized repositories. Thirty members of the NIH staff then analyzed a random sample of the remaining articles to estimate how many and what types of datasets were used per article. Two reviewers independently examined each paper. Each dataset is titled Bigdata_randomsample_xxxx_xx. The xxxx refers to the set of articles the annotator looked at, while the xxidentifies the annotator that did the analysis. Within each dataset, the author has listed the number of datasets they identified within the articles that they looked at. For every dataset that was found, the annotators were asked to insert a new row into the spreadsheet, and then describe the dataset they found (e.g., type of data, subject of study, etc.). Each row in the spreadsheet was always prepended by the PubMed Identifier (PMID) where the dataset was found. Finally, the files 2013-08-07_Bigdatastudy_dataanalysis, Dataanalysis_ack_si_datasets, and Datasets additional random sample mention vs deposit 20150313 refer to the analysis that was performed based on each annotator's analysis of the publications they were assigned, and the data deposits identified from the analysis.

  11. Z

    #PraCegoVer dataset

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Jan 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gabriel Oliveira dos Santos; Esther Luna Colombini; Sandra Avila (2023). #PraCegoVer dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5710561
    Explore at:
    Dataset updated
    Jan 19, 2023
    Dataset provided by
    Institute of Computing, University of Campinas
    Authors
    Gabriel Oliveira dos Santos; Esther Luna Colombini; Sandra Avila
    Description

    Automatically describing images using natural sentences is an essential task to visually impaired people's inclusion on the Internet. Although there are many datasets in the literature, most of them contain only English captions, whereas datasets with captions described in other languages are scarce.

    PraCegoVer arose on the Internet, stimulating users from social media to publish images, tag #PraCegoVer and add a short description of their content. Inspired by this movement, we have proposed the #PraCegoVer, a multi-modal dataset with Portuguese captions based on posts from Instagram. It is the first large dataset for image captioning in Portuguese with freely annotated images.

    PraCegoVer has 533,523 pairs with images and captions described in Portuguese collected from more than 14 thousand different profiles. Also, the average caption length in #PraCegoVer is 39.3 words and the standard deviation is 29.7.

    Dataset Structure

    PraCegoVer dataset is composed of the main file dataset.json and a collection of compressed files named images.tar.gz.partX

    containing the images. The file dataset.json comprehends a list of json objects with the attributes:

    user: anonymized user that made the post;

    filename: image file name;

    raw_caption: raw caption;

    caption: clean caption;

    date: post date.

    Each instance in dataset.json is associated with exactly one image in the images directory whose filename is pointed by the attribute filename. Also, we provide a sample with five instances, so the users can download the sample to get an overview of the dataset before downloading it completely.

    Download Instructions

    If you just want to have an overview of the dataset structure, you can download sample.tar.gz. But, if you want to use the dataset, or any of its subsets (63k and 173k), you must download all the files and run the following commands to uncompress and join the files:

    cat images.tar.gz.part* > images.tar.gz tar -xzvf images.tar.gz

    Alternatively, you can download the entire dataset from the terminal using the python script download_dataset.py available in PraCegoVer repository. In this case, first, you have to download the script and create an access token here. Then, you can run the following command to download and uncompress the image files:

    python download_dataset.py --access_token=

  12. Sample Transformer Dataset

    • kaggle.com
    zip
    Updated Jan 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Piyush Grover (2024). Sample Transformer Dataset [Dataset]. https://www.kaggle.com/datasets/piygro/sample-transformer-dataset
    Explore at:
    zip(428630229 bytes)Available download formats
    Dataset updated
    Jan 17, 2024
    Authors
    Piyush Grover
    Description

    Sample Clean dataset for the training of transformer model.

    *Note: Rename the .mp4 to .zip extension to use *

  13. w

    Amazon Web Services - Public Data Sets

    • data.wu.ac.at
    Updated Oct 10, 2013
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Global (2013). Amazon Web Services - Public Data Sets [Dataset]. https://data.wu.ac.at/schema/datahub_io/NTYxNjkxNmYtNmZlNS00N2EwLWJkYTktZjFjZWJkNTM2MTNm
    Explore at:
    Dataset updated
    Oct 10, 2013
    Dataset provided by
    Global
    Description

    About

    From website:

    Public Data Sets on AWS provides a centralized repository of public data sets that can be seamlessly integrated into AWS cloud-based applications. AWS is hosting the public data sets at no charge for the community, and like all AWS services, users pay only for the compute and storage they use for their own applications. An initial list of data sets is already available, and more will be added soon.

    Previously, large data sets such as the mapping of the Human Genome and the US Census data required hours or days to locate, download, customize, and analyze. Now, anyone can access these data sets from their Amazon Elastic Compute Cloud (Amazon EC2) instances and start computing on the data within minutes. Users can also leverage the entire AWS ecosystem and easily collaborate with other AWS users. For example, users can produce or use prebuilt server images with tools and applications to analyze the data sets. By hosting this important and useful data with cost-efficient services such as Amazon EC2, AWS hopes to provide researchers across a variety of disciplines and industries with tools to enable more innovation, more quickly.

  14. 2

    APS; Personal Well-Being; Subjective Well-Being

    • beta.ukdataservice.ac.uk
    • datacatalogue.ukdataservice.ac.uk
    Updated May 11, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Office for National Statistics, Social Survey Division (2016). APS; Personal Well-Being; Subjective Well-Being [Dataset]. http://doi.org/10.5255/UKDA-SN-7961-1
    Explore at:
    Dataset updated
    May 11, 2016
    Dataset provided by
    UK Data Servicehttps://ukdataservice.ac.uk/
    Authors
    Office for National Statistics, Social Survey Division
    Time period covered
    Apr 1, 2011 - Mar 1, 2015
    Area covered
    United Kingdom
    Description

    The Annual Population Survey (APS) is a major survey series, which aims to provide data that can produce reliable estimates at local authority level. Key topics covered in the survey include education, employment, health and ethnicity. The APS comprises key variables from the Labour Force Survey (LFS) (held at the UK Data Archive under GN 33246), all of its associated LFS boosts and the APS boost. Thus, the APS combines results from five different sources: the LFS (waves 1 and 5); the English Local Labour Force Survey (LLFS), the Welsh Labour Force Survey (WLFS), the Scottish Labour Force Survey (SLFS) and the Annual Population Survey Boost Sample (APS(B) - however, this ceased to exist at the end of December 2005, so APS data from January 2006 onwards will contain all the above data apart from APS(B)). Users should note that the LLFS, WLFS, SLFS and APS(B) are not held separately at the UK Data Archive. For further detailed information about methodology, users should consult the Labour Force Survey User Guide, selected volumes of which have been included with the APS documentation for reference purposes (see 'Documentation' table below).

    The APS aims to provide enhanced annual data for England, covering a target sample of at least 510 economically active persons for each Unitary Authority (UA)/Local Authority District (LAD) and at least 450 in each Greater London Borough. In combination with local LFS boost samples such as the WLFS and SLFS, the survey provides estimates for a range of indicators down to Local Education Authority (LEA) level across the United Kingdom.

    APS Well-Being data
    Since April 2011, the APS has included questions about personal and subjective well-being. The responses to these questions have been made available as annual sub-sets to the APS Person level files. It is important to note that the size of the achieved sample of the well-being questions within the dataset is approximately 165,000 people. This reduction is due to the well-being questions being only asked of persons aged 16 and above, who gave a personal interview and proxy answers are not accepted. As a result some caution should be used when using analysis of responses to well-being questions at detailed geography areas and also in relation to any other variables where respondent numbers are relatively small. It is recommended that for lower level geography analysis that the variable UACNTY09 is used.

    As well as annual datasets, three-year pooled datasets are available. When combining multiple APS datasets together, it is important to account for the rotational design of the APS and ensure that no person appears more than once in the multiple year dataset. This is because the well-being datasets are not designed to be longitudinal e.g. they are not designed to track individuals over time/be used for longitudinal analysis. They are instead cross-sectional, and are designed to use a cross-section of the population to make inferences about the whole population. For this reason, the three-year dataset has been designed to include only a selection of the cases from the individual year APS datasets, chosen in such a way that no individuals are included more than once, and the cases included are approximately equally spread across the three years. Further information is available in the 'Documentation' section below.

    Secure Access APS Well-Being data
    Secure Access datasets for the APS Well-Being include additional variables not included in either the standard End User Licence (EUL) versions (see under GN 33357) or the Special Licence (SL) access versions (see under GN 33376). Extra variables that typically can be found in the Secure Access version but not in the EUL or SL versions relate to:

    • geography, including:
      • Postcodes
      • Census Area Statistics (CAS) Wards
      • Census Output Areas
      • Nomenclature of Units for Territorial Statistics (NUTS) level 2 and 3 areas
      • Lower and Middle Layer Super Output Areas
      • Travel to Work Areas
      • Unitary authority / Local Authority District of place of work (main job)
      • region of place of work for first and second jobs
    • qualifications, education and training including level of highest qualification, qualifications from Government schemes, qualifications related to work, qualifications from school, qualifications from university of college and qualifications gained from outside the UK
    • detailed ethnic group for Scottish respondents
    • detailed religious denomination for Northern Irish respondents
    • length health problem has limited activity
    • learning difficulty or learning disability
    • occupation in apprenticeship or second job
    • number of bedrooms
    • number of dependent children in household aged under 19
    Prospective users of the Secure Access version of the APS Well-Being will need to fulfil additional requirements, commencing with the completion of an extra application form to demonstrate to the data owners exactly why they need access to the extra, more detailed variables, in order to obtain permission to use that version. Secure Access data users must also complete face-to-face training and agree to the Secure Access User Agreement and Licence Compliance Policy (see 'Access' section below). Therefore, users are encouraged to download and inspect the EUL version of the data prior to ordering the Secure Access (or SL) version. Further details and links to all APS studies available from the UK Data Archive can be found via the APS Key Data series webpage.

    APS Well-Being Datasets: Information, July 2016
    From 2012-2015, the ONS published separate APS datasets aimed at providing initial estimates of subjective well-being, based on the Integrated Household Survey. In 2015 these were discontinued. A separate set of well-being variables and a corresponding weighting variable have been added to the April-March APS person datasets from A11M12 onwards. Users should no longer use the bespoke well-being datasets (SNs 6994, 6999, 7091, 7092, 7364, 7365, 7565, 7566 and 7961, but should now use the variables included on the April-March APS person datasets instead. Further information on the transition can be found on the Personal well-being in the UK: 2015 to 2016

    Documentation and coding frames
    The APS is compiled from variables present in the LFS. For variable and value labelling and coding frames that are not included either in the data or in the current APS documentation (e.g. coding frames for education, industrial and geographic variables, which are held in LFS User Guide Vol.5, Classifications), users are advised to consult the latest versions of the LFS User Guides, which are available from the ONS Labour Force Survey - User Guidance webpages.

    May 2018 Update
    Due to a change in the Travel-to-Work Area coding structure from 2001 to 2011, the variable TTWA9D has been relabelled in the pooled data file for 2012-2015.

  15. RICO dataset

    • kaggle.com
    zip
    Updated Dec 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Onur Gunes (2021). RICO dataset [Dataset]. https://www.kaggle.com/datasets/onurgunes1993/rico-dataset
    Explore at:
    zip(6703669364 bytes)Available download formats
    Dataset updated
    Dec 1, 2021
    Authors
    Onur Gunes
    Description

    Context

    Data-driven models help mobile app designers understand best practices and trends, and can be used to make predictions about design performance and support the creation of adaptive UIs. This paper presents Rico, the largest repository of mobile app designs to date, created to support five classes of data-driven applications: design search, UI layout generation, UI code generation, user interaction modeling, and user perception prediction. To create Rico, we built a system that combines crowdsourcing and automation to scalably mine design and interaction data from Android apps at runtime. The Rico dataset contains design data from more than 9.3k Android apps spanning 27 categories. It exposes visual, textual, structural, and interactive design properties of more than 66k unique UI screens. To demonstrate the kinds of applications that Rico enables, we present results from training an autoencoder for UI layout similarity, which supports query-by-example search over UIs.

    Content

    Rico was built by mining Android apps at runtime via human-powered and programmatic exploration. Like its predecessor ERICA, Rico’s app mining infrastructure requires no access to — or modification of — an app’s source code. Apps are downloaded from the Google Play Store and served to crowd workers through a web interface. When crowd workers use an app, the system records a user interaction trace that captures the UIs visited and the interactions performed on them. Then, an automated agent replays the trace to warm up a new copy of the app and continues the exploration programmatically, leveraging a content-agnostic similarity heuristic to efficiently discover new UI states. By combining crowdsourcing and automation, Rico can achieve higher coverage over an app’s UI states than either crawling strategy alone. In total, 13 workers recruited on UpWork spent 2,450 hours using apps on the platform over five months, producing 10,811 user interaction traces. After collecting a user trace for an app, we ran the automated crawler on the app for one hour.

    Acknowledgements

    UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN https://interactionmining.org/rico

    Inspiration

    The Rico dataset is large enough to support deep learning applications. We trained an autoencoder to learn an embedding for UI layouts, and used it to annotate each UI with a 64-dimensional vector representation encoding visual layout. This vector representation can be used to compute structurally — and often semantically — similar UIs, supporting example-based search over the dataset. To create training inputs for the autoencoder that embed layout information, we constructed a new image for each UI capturing the bounding box regions of all leaf elements in its view hierarchy, differentiating between text and non-text elements. Rico’s view hierarchies obviate the need for noisy image processing or OCR techniques to create these inputs.

  16. Ex-R Study Urine Data

    • catalog.data.gov
    Updated Jan 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2022). Ex-R Study Urine Data [Dataset]. https://catalog.data.gov/dataset/ex-r-study-urine-data
    Explore at:
    Dataset updated
    Jan 24, 2022
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    Emory University (analyzed the urine samples for pyrethroid metabolites). This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: Contact Researcher. Format: Pyrethroid metabolite concentration data for 50 adults over six-weeks. This dataset is associated with the following publication: Morgan , M., J. Sobus , D.B. Barr, C. Croghan , F. Chen , R. Walker, L. Alston, E. Andersen, and M. Clifton. Temporal variability of pyrethroid metabolite levels in bedtime, morning, and 24-hr urine samples for 50 adults in North Carolina. ENVIRONMENT INTERNATIONAL. Elsevier Science Ltd, New York, NY, USA, 144: 81-91, (2015).

  17. d

    (HS 4) Large Extent Spatial Datasets in Maryland

    • dataone.org
    • hydroshare.org
    • +1more
    Updated Apr 13, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Young-Don Choi (2024). (HS 4) Large Extent Spatial Datasets in Maryland [Dataset]. https://dataone.org/datasets/sha256%3Afd0ad49bc510d98fb88234393a7cd5baa8a1f304b34d94386306ca15659b904c
    Explore at:
    Dataset updated
    Apr 13, 2024
    Dataset provided by
    Hydroshare
    Authors
    Young-Don Choi
    Area covered
    Description

    This HydroShare resource was created to share large extent spatial (LES) datasets in Maryland on GeoServer (https://geoserver.hydroshare.org/geoserver/web/wicket/bookmarkable/org.geoserver.web.demo.MapPreviewPage) and THREDDS (https://thredds.hydroshare.org/thredds/catalog/hydroshare/resources/catalog.html).

    Users can access the uploaded LES datasets on HydroShare-GeoServer and THREDDS using this HS resource id. This resource was created using HS 2.

    Then, through the RHESSys workflows, users can subset LES datasets using OWSLib and xarray.

  18. Adult English Voice Dataset Sample

    • kaggle.com
    zip
    Updated Oct 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    longmaodata (2024). Adult English Voice Dataset Sample [Dataset]. https://www.kaggle.com/datasets/longmaodata/adult-english-voice-dataset-sample
    Explore at:
    zip(13499234 bytes)Available download formats
    Dataset updated
    Oct 15, 2024
    Authors
    longmaodata
    Description

    A complete 28.57GB dataset has been uploaded to Hugging Face. Click to jump to:https://huggingface.co/datasets/longmaodata/Adult-Voice

    🔔Due to the platform's upload size restrictions and the extensive nature of our numerous public datasets, we can only provide samples of the datasets here. If you need the full public dataset, please join our official group to access it;

    🔔It is entirely free!

    🔔This helps promote open-source development!

    Complete data size

    28.57GB

    Join the group

    🚀🚀🚀🚀https://t.me/+Y5kL2iHis9A0ZWI1

    ✅ Obtain a complete dataset

    ✅ Mutual communication within the industry

    ✅ Get more information and consultation

    ✅ Timely dataset update notifications

    Dataset Introduction

    TTS average voice library

    Version

    v1.0

    Release Date

    2024-10-15

    Data Description

    Age: 18-70 years old, balanced across all ages

    Gender: Balanced between males and females

    Accent: Mandarin accent

    Recording environment: Background environment is quiet, SNR>20 dB

    Number of participants: Around 200 people, 100 from southern China and 100 from northern China

    Distance from microphone: Less than 20cm

    Style: Normal speaking speed and volume, realistic and natural

    Equipment: Various types of mobile phones

    Content: Recording texts focus on education and entertainment topics, covering daily conversations in school, family settings, such as stories, movies and TV shows, role-playing, and everyday family dialogues, while balancing phonemes.

    Number of texts: There are 24678 non-repeated texts, with an average of 12 words per line

    Format: Audio files are saved in 16k.wav format (file name is unique code); Texts are saved in excel format

    Quantity: Total length is 111.8h, total number of lines is 67977

    Directory Structure

    root_directory/
    
    ├── audio/   
    
    │  ├── audio1.wav
    
    
  19. i

    Malware Analysis Datasets: Top-1000 PE Imports

    • ieee-dataport.org
    Updated Nov 8, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Angelo Oliveira (2019). Malware Analysis Datasets: Top-1000 PE Imports [Dataset]. https://ieee-dataport.org/open-access/malware-analysis-datasets-top-1000-pe-imports
    Explore at:
    Dataset updated
    Nov 8, 2019
    Authors
    Angelo Oliveira
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is part of my PhD research on malware detection and classification using Deep Learning. It contains static analysis data: Top-1000 imported functions extracted from the 'pe_imports' elements of Cuckoo Sandbox reports. PE malware examples were downloaded from virusshare.com. PE goodware examples were downloaded from portableapps.com and from Windows 7 x86 directories.

  20. Sample datasets and their access information.

    • plos.figshare.com
    xls
    Updated Nov 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Diksha Pandey; Onkara Perumal P. (2023). Sample datasets and their access information. [Dataset]. http://doi.org/10.1371/journal.pone.0293939.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Nov 9, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Diksha Pandey; Onkara Perumal P.
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Enormous gene expression data generated through next-generation sequencing (NGS) technologies are accessible to the scientific community via public repositories. The data harboured in these repositories are foundational for data integrative studies enabling large-scale data analysis whose potential is yet to be fully realized. Prudent integration of individual gene expression data i.e. RNA-Seq datasets is remarkably challenging as it encompasses an assortment and series of data analysis steps that requires to be accomplished before arriving at meaningful insights on biological interrogations. These insights are at all times latent within the data and are not usually revealed from the modest individual data analysis owing to the limited number of biological samples in individual studies. Nevertheless, a sensibly designed meta-analysis of select individual studies would not only maximize the sample size of the analysis but also significantly improves the statistical power of analysis thereby revealing the latent insights. In the present study, a custom-built meta-analysis pipeline is presented for the integration of multiple datasets from different origins. As a case study, we have tested with the integration of two relevant datasets pertaining to diabetic vasculopathy retrieved from the open source domain. We report the meta-analysis ameliorated distinctive and latent gene regulators of diabetic vasculopathy and uncovered a total of 975 i.e. 930 up-regulated and 45 down-regulated gene signatures. Further investigation revealed a subset of 14 DEGs including CTLA4, CALR, G0S2, CALCR, OMA1, and DNAJC3 as latent i.e. novel as these signatures have not been reported earlier. Moreover, downstream investigations including enrichment analysis, and protein-protein interaction (PPI) network analysis of DEGs revealed durable disease association signifying their potential as novel transcriptomic biomarkers of diabetic vasculopathy. While the meta-analysis of individual whole transcriptomic datasets for diabetic vasculopathy is exclusive to our comprehension, however, the novel meta-analysis pipeline could very well be extended to study the mechanistic links of DEGs in other disease conditions.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Oxylabs (2017). Company Datasets for Business Profiling [Dataset]. https://datarade.ai/data-products/company-datasets-for-business-profiling-oxylabs
Organization logo

Company Datasets for Business Profiling

Explore at:
.json, .xml, .csv, .xlsAvailable download formats
Dataset updated
Feb 23, 2017
Dataset authored and provided by
Oxylabs
Area covered
British Indian Ocean Territory, Bangladesh, Moldova (Republic of), Northern Mariana Islands, Canada, Isle of Man, Nepal, Taiwan, Andorra, Tunisia
Description

Company Datasets for valuable business insights!

Discover new business prospects, identify investment opportunities, track competitor performance, and streamline your sales efforts with comprehensive Company Datasets.

These datasets are sourced from top industry providers, ensuring you have access to high-quality information:

  • Owler: Gain valuable business insights and competitive intelligence. -AngelList: Receive fresh startup data transformed into actionable insights. -CrunchBase: Access clean, parsed, and ready-to-use business data from private and public companies. -Craft.co: Make data-informed business decisions with Craft.co's company datasets. -Product Hunt: Harness the Product Hunt dataset, a leader in curating the best new products.

We provide fresh and ready-to-use company data, eliminating the need for complex scraping and parsing. Our data includes crucial details such as:

  • Company name;
  • Size;
  • Founding date;
  • Location;
  • Industry;
  • Revenue;
  • Employee count;
  • Competitors.

You can choose your preferred data delivery method, including various storage options, delivery frequency, and input/output formats.

Receive datasets in CSV, JSON, and other formats, with storage options like AWS S3 and Google Cloud Storage. Opt for one-time, monthly, quarterly, or bi-annual data delivery.

With Oxylabs Datasets, you can count on:

  • Fresh and accurate data collected and parsed by our expert web scraping team.
  • Time and resource savings, allowing you to focus on data analysis and achieving your business goals.
  • A customized approach tailored to your specific business needs.
  • Legal compliance in line with GDPR and CCPA standards, thanks to our membership in the Ethical Web Data Collection Initiative.

Pricing Options:

Standard Datasets: choose from various ready-to-use datasets with standardized data schemas, priced from $1,000/month.

Custom Datasets: Tailor datasets from any public web domain to your unique business needs. Contact our sales team for custom pricing.

Experience a seamless journey with Oxylabs:

  • Understanding your data needs: We work closely to understand your business nature and daily operations, defining your unique data requirements.
  • Developing a customized solution: Our experts create a custom framework to extract public data using our in-house web scraping infrastructure.
  • Delivering data sample: We provide a sample for your feedback on data quality and the entire delivery process.
  • Continuous data delivery: We continuously collect public data and deliver custom datasets per the agreed frequency.

Unlock the power of data with Oxylabs' Company Datasets and supercharge your business insights today!

Search
Clear search
Close search
Google apps
Main menu