100+ datasets found

VegeNet - Image datasets and Codes
zenodo.org
data.niaid.nih.gov
zip
Updated Oct 27, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jo Yen Tan; Jo Yen Tan (2022). VegeNet - Image datasets and Codes [Dataset]. http://doi.org/10.5281/zenodo.7254508
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7254508
Dataset updated
Oct 27, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jo Yen Tan; Jo Yen Tan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Compilation of python codes for data preprocessing and VegeNet building, as well as image datasets (zip files).

Image datasets:

vege_original : Images of vegetables captured manually in data acquisition stage

vege_cropped_renamed : Images in (1) cropped to remove background areas and image labels renamed

non-vege images : Images of non-vegetable foods for CNN network to recognize other-than-vegetable foods

food_image_dataset : Complete set of vege (2) and non-vege (3) images for architecture building.

food_image_dataset_split : Image dataset (4) split into train and test sets

process : Images created when cropping (pre-processing step) to create dataset (2).
Amount of data created, consumed, and stored 2010-2023, with forecasts to...
statista.com
Updated Nov 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2024). Amount of data created, consumed, and stored 2010-2023, with forecasts to 2028 [Dataset]. https://www.statista.com/statistics/871513/worldwide-data-created/
Explore at:
Dataset updated
Nov 21, 2024
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
May 2024
Area covered
Worldwide
Description
The total amount of data created, captured, copied, and consumed globally is forecast to increase rapidly, reaching 149 zettabytes in 2024. Over the next five years up to 2028, global data creation is projected to grow to more than 394 zettabytes. In 2020, the amount of data created and replicated reached a new high. The growth was higher than previously expected, caused by the increased demand due to the COVID-19 pandemic, as more people worked and learned from home and used home entertainment options more often. Storage capacity also growing Only a small percentage of this newly created data is kept though, as just two percent of the data produced and consumed in 2020 was saved and retained into 2021. In line with the strong growth of the data volume, the installed base of storage capacity is forecast to increase, growing at a compound annual growth rate of 19.2 percent over the forecast period from 2020 to 2025. In 2020, the installed base of storage capacity reached 6.7 zettabytes.
a
02.1 Integrating Data in ArcGIS Pro
hub.arcgis.com
Updated Feb 15, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Iowa Department of Transportation (2017). 02.1 Integrating Data in ArcGIS Pro [Dataset]. https://hub.arcgis.com/documents/cd5acdcc91324ea383262de3ecec17d0
Explore at:
Dataset updated
Feb 15, 2017
Dataset authored and provided by
Iowa Department of Transportation
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
You have been assigned a new project, which you have researched, and you have identified the data that you need.The next step is to gather, organize, and potentially create the data that you need for your project analysis.In this course, you will learn how to gather and organize data using ArcGIS Pro. You will also create a file geodatabase where you will store the data that you import and create.After completing this course, you will be able to perform the following tasks:Create a geodatabase in ArcGIS Pro.Create feature classes in ArcGIS Pro by exporting and importing data.Create a new, empty feature class in ArcGIS Pro.
w
Synthetic Data for an Imaginary Country, Sample, 2023 - World
microdata.worldbank.org
Updated Jul 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Development Data Group, Data Analytics Unit (2023). Synthetic Data for an Imaginary Country, Sample, 2023 - World [Dataset]. https://microdata.worldbank.org/index.php/catalog/5906
Explore at:
Dataset updated
Jul 7, 2023
Dataset authored and provided by
Development Data Group, Data Analytics Unit
Time period covered
2023
Area covered
World, World
Description
Abstract

The dataset is a relational dataset of 8,000 households households, representing a sample of the population of an imaginary middle-income country. The dataset contains two data files: one with variables at the household level, the other one with variables at the individual level. It includes variables that are typically collected in population censuses (demography, education, occupation, dwelling characteristics, fertility, mortality, and migration) and in household surveys (household expenditure, anthropometric data for children, assets ownership). The data only includes ordinary households (no community households). The dataset was created using REaLTabFormer, a model that leverages deep learning methods. The dataset was created for the purpose of training and simulation and is not intended to be representative of any specific country.

The full-population dataset (with about 10 million individuals) is also distributed as open data.

Geographic coverage

The dataset is a synthetic dataset for an imaginary country. It was created to represent the population of this country by province (equivalent to admin1) and by urban/rural areas of residence.

Analysis unit

Household, Individual

Universe

The dataset is a fully-synthetic dataset representative of the resident population of ordinary households for an imaginary middle-income country.

Kind of data

ssd

Sampling procedure

The sample size was set to 8,000 households. The fixed number of households to be selected from each enumeration area was set to 25. In a first stage, the number of enumeration areas to be selected in each stratum was calculated, proportional to the size of each stratum (stratification by geo_1 and urban/rural). Then 25 households were randomly selected within each enumeration area. The R script used to draw the sample is provided as an external resource.

Mode of data collection

other

Research instrument

The dataset is a synthetic dataset. Although the variables it contains are variables typically collected from sample surveys or population censuses, no questionnaire is available for this dataset. A "fake" questionnaire was however created for the sample dataset extracted from this dataset, to be used as training material.

Cleaning operations

The synthetic data generation process included a set of "validators" (consistency checks, based on which synthetic observation were assessed and rejected/replaced when needed). Also, some post-processing was applied to the data to result in the distributed data files.

Response rate

This is a synthetic dataset; the "response rate" is 100%.
h
Generate-Distill-Data
huggingface.co
Updated Apr 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
JHU Human Language Technology Center of Excellence (2025). Generate-Distill-Data [Dataset]. https://huggingface.co/datasets/hltcoe/Generate-Distill-Data
Explore at:
Dataset updated
Apr 28, 2025
Dataset authored and provided by
JHU Human Language Technology Center of Excellence
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
hltcoe/Generate-Distill-Data dataset hosted on Hugging Face and contributed by the HF Datasets community
r
R codes and dataset for Visualisation of Diachronic Constructional Change...
researchdata.edu.au
bridges.monash.edu
Updated Apr 1, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gede Primahadi Wijaya Rajeg; Gede Primahadi Wijaya Rajeg (2019). R codes and dataset for Visualisation of Diachronic Constructional Change using Motion Chart [Dataset]. http://doi.org/10.26180/5c844c7a81768
Explore at:
Unique identifier
https://doi.org/10.26180/5c844c7a81768
Dataset updated
Apr 1, 2019
Dataset provided by
Monash University
Authors
Gede Primahadi Wijaya Rajeg; Gede Primahadi Wijaya Rajeg
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Publication

Primahadi Wijaya R., Gede. 2014. Visualisation of diachronic constructional change using Motion Chart. In Zane Goebel, J. Herudjati Purwoko, Suharno, M. Suryadi & Yusuf Al Aried (eds.). Proceedings: International Seminar on Language Maintenance and Shift IV (LAMAS IV), 267-270. Semarang: Universitas Diponegoro. doi: https://doi.org/10.4225/03/58f5c23dd8387

Description of R codes and data files in the repository

This repository is imported from its GitHub repo. Versioning of this figshare repository is associated with the GitHub repo's Release. So, check the Releases page for updates (the next version is to include the unified version of the codes in the first release with the tidyverse).

The raw input data consists of two files (i.e. will_INF.txt and go_INF.txt). They represent the co-occurrence frequency of top-200 infinitival collocates for will and be going to respectively across the twenty decades of Corpus of Historical American English (from the 1810s to the 2000s).

These two input files are used in the R code file 1-script-create-input-data-raw.r. The codes preprocess and combine the two files into a long format data frame consisting of the following columns: (i) decade, (ii) coll (for "collocate"), (iii) BE going to (for frequency of the collocates with be going to) and (iv) will (for frequency of the collocates with will); it is available in the input_data_raw.txt.

Then, the script 2-script-create-motion-chart-input-data.R processes the input_data_raw.txt for normalising the co-occurrence frequency of the collocates per million words (the COHA size and normalising base frequency are available in coha_size.txt). The output from the second script is input_data_futurate.txt.

Next, input_data_futurate.txt contains the relevant input data for generating (i) the static motion chart as an image plot in the publication (using the script 3-script-create-motion-chart-plot.R), and (ii) the dynamic motion chart (using the script 4-script-motion-chart-dynamic.R).

The repository adopts the project-oriented workflow in RStudio; double-click on the Future Constructions.Rproj file to open an RStudio session whose working directory is associated with the contents of this repository.
d
Footprints and producers of source data used to create southern portion of...
catalog.data.gov
data.usgs.gov
+2more
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Footprints and producers of source data used to create southern portion of the high-resolution (1 m) San Francisco Bay, California, digital elevation model (DEM) [Dataset]. https://catalog.data.gov/dataset/footprints-and-producers-of-source-data-used-to-create-southern-portion-of-the-high-resolu
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
San Francisco Bay, California
Description
Polygon shapefile showing the footprint boundaries, source agency origins, and resolutions of compiled bathymetric digital elevation models (DEMs) used to construct a continuous, high-resolution DEM of the southern portion of San Francisco Bay.
C
Synthetic Integrated Services Data
data.wprdc.org
csv, html, pdf, zip
Updated Jun 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Allegheny County (2024). Synthetic Integrated Services Data [Dataset]. https://data.wprdc.org/dataset/synthetic-integrated-services-data
Explore at:
html, zip(39231637), csv(1375554033), pdfAvailable download formats
Dataset updated
Jun 25, 2024
Dataset provided by
Allegheny County
Description
Motivation

This dataset was created to pilot techniques for creating synthetic data from datasets containing sensitive and protected information in the local government context. Synthetic data generation replaces actual data with representative data generated from statistical models; this preserves the key data properties that allow insights to be drawn from the data while protecting the privacy of the people included in the data. We invite you to read the Understanding Synthetic Data white paper for a concise introduction to synthetic data.

This effort was a collaboration of the Urban Institute, Allegheny County’s Department of Human Services (DHS) and CountyStat, and the University of Pittsburgh’s Western Pennsylvania Regional Data Center.

Collection

The source data for this project consisted of 1) month-by-month records of services included in Allegheny County's data warehouse and 2) demographic data about the individuals who received the services. As the County’s data warehouse combines this service and client data, this data is referred to as “Integrated Services data”. Read more about the data warehouse and the kinds of services it includes here.

Preprocessing

Synthetic data are typically generated from probability distributions or models identified as being representative of the confidential data. For this dataset, a model of the Integrated Services data was used to generate multiple versions of the synthetic dataset. These different candidate datasets were evaluated to select for publication the dataset version that best balances utility and privacy. For high-level information about this evaluation, see the Synthetic Data User Guide.

For more information about the creation of the synthetic version of this data, see the technical brief for this project, which discusses the technical decision making and modeling process in more detail.

Recommended Uses

This disaggregated synthetic data allows for many analyses that are not possible with aggregate data (summary statistics). Broadly, this synthetic version of this data could be analyzed to better understand the usage of human services by people in Allegheny County, including the interplay in the usage of multiple services and demographic information about clients.

Known Limitations/Biases

Some amount of deviation from the original data is inherent to the synthetic data generation process. Specific examples of limitations (including undercounts and overcounts for the usage of different services) are given in the Synthetic Data User Guide and the technical report describing this dataset's creation.

Feedback

Please reach out to this dataset's data steward (listed below) to let us know how you are using this data and if you found it to be helpful. Please also provide any feedback on how to make this dataset more applicable to your work, any suggestions of future synthetic datasets, or any additional information that would make this more useful. Also, please copy wprdc@pitt.edu on any such feedback (as the WPRDC always loves to hear about how people use the data that they publish and how the data could be improved).

Further Documentation and Resources

1) A high-level overview of synthetic data generation as a method for protecting privacy can be found in the Understanding Synthetic Data white paper.
2) The Synthetic Data User Guide provides high-level information to help users understand the motivation, evaluation process, and limitations of the synthetic version of Allegheny County DHS's Human Services data published here.
3) Generating a Fully Synthetic Human Services Dataset: A Technical Report on Synthesis and Evaluation Methodologies describes the full technical methodology used for generating the synthetic data, evaluating the various options, and selecting the final candidate for publication.
4) The WPRDC also hosts the Allegheny County Human Services Community Profiles dataset, which provides annual updates on human-services usage, aggregated by neighborhood/municipality. That data can be explored using the County's Human Services Community Profile web site.

How to create an Okta Account

data.nsw.gov.au
researchdata.edu.au

Updated May 29, 2025

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Spatial Services (DCS) (2025). How to create an Okta Account [Dataset]. https://data.nsw.gov.au/data/dataset/1-8584a80239754e66b39a1325f08f5b53

Explore at:

Dataset updated

May 29, 2025

Dataset provided by

Spatial Services (DCS)

Description

Access API

Metadata Portal Metadata Information

Content Title	How to create an Okta Account
Content Type	Document
Description	Documentation on how to create an Okta Account
Initial Publication Date	09/07/2024
Data Currency	09/07/2024
Data Update Frequency	Other
Content Source	Data provider files
File Type	Document
Attribution
Data Theme, Classification or Relationship to other Datasets
Accuracy
Spatial Reference System (dataset)	Other
Spatial Reference System (web service)	Other
WGS84 Equivalent To	Other
Spatial Extent
Content Lineage
Data Classification	Unclassified
Data Access Policy	Open
Data Quality
Terms and Conditions	Creative Commons
Standard and Specification
Data Custodian	Customer Hub
Point of Contact	Customer Hub
Data Aggregator
Data Distributor
Additional Supporting Information
TRIM Number

d
Data from: 10-m backscatter mosaic produced from backscatter intensity data...
catalog.data.gov
search.dataone.org
+3more
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). 10-m backscatter mosaic produced from backscatter intensity data from sidescan sonar and multibeam datasets (BS_composite_10m.tif GeoTIFF Image; UTM, Zone 19N, WGS 84) [Dataset]. https://catalog.data.gov/dataset/10-m-backscatter-mosaic-produced-from-backscatter-intensity-data-from-sidescan-sonar-and-m
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
These data are qualitatively derived interpretive polygon shapefiles and selected source raster data defining surficial geology, sediment type and distribution, and physiographic zones of the sea floor from Nahant to Northern Cape Cod Bay. Much of the geophysical data used to create the interpretive layers were collected under a cooperative agreement among the Massachusetts Office of Coastal Zone Management (CZM), the U.S. Geological Survey (USGS), Coastal and Marine Geology Program, the National Oceanic and Atmospheric Administration (NOAA), and the U.S. Army Corps of Engineers (USACE). Initiated in 2003, the primary objective of this program is to develop regional geologic framework information for the management of coastal and marine resources. Accurate data and maps of seafloor geology are important first steps toward protecting fish habitat, delineating marine resources, and assessing environmental changes because of natural or human effects. The project is focused on the inshore waters of coastal Massachusetts. Data collected during the mapping cooperative involving the USGS have been released in a series of USGS Open-File Reports (http://woodshole.er.usgs.gov/project-pages/coastal_mass/html/current_map.html). The interpretations released in this study are for an area extending from the southern tip of Nahant to Northern Cape Cod Bay, Massachusetts. A combination of geophysical and sample data including high resolution bathymetry and lidar, acoustic-backscatter intensity, seismic-reflection profiles, bottom photographs, and sediment samples are used to create the data interpretations. Most of the nearshore geophysical and sample data (including the bottom photographs) were collected during several cruises between 2000 and 2008. More information about the cruises and the data collected can be found at the Geologic Mapping of the Seafloor Offshore of Massachusetts Web page: http://woodshole.er.usgs.gov/project-pages/coastal_mass/.
Small Business Contact Data | North American Small Business Owners |...
datarade.ai
Updated Oct 27, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Success.ai (2021). Small Business Contact Data | North American Small Business Owners | Verified Contact Details from 170M Profiles | Best Price Guaranteed [Dataset]. https://datarade.ai/data-products/small-business-contact-data-north-american-small-business-o-success-ai
Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset updated
Oct 27, 2021
Dataset provided by
Area covered
Guatemala, Greenland, Saint Pierre and Miquelon, Costa Rica, Honduras, Belize, Panama, Bermuda, Mexico, United States of America
Description
Access B2B Contact Data for North American Small Business Owners with Success.ai—your go-to provider for verified, high-quality business datasets. This dataset is tailored for businesses, agencies, and professionals seeking direct access to decision-makers within the small business ecosystem across North America. With over 170 million professional profiles, it’s an unparalleled resource for powering your marketing, sales, and lead generation efforts.

Key Features of the Dataset:

Verified Contact Details

Includes accurate and up-to-date email addresses and phone numbers to ensure you reach your targets reliably.

AI-validated for 99% accuracy, eliminating errors and reducing wasted efforts.

Detailed Professional Insights

Comprehensive data points include job titles, skills, work experience, and education to enable precise segmentation and targeting.

Enriched with insights into decision-making roles, helping you connect directly with small business owners, CEOs, and other key stakeholders.

Business-Specific Information

Covers essential details such as industry, company size, location, and more, enabling you to tailor your campaigns effectively. Ideal for profiling and understanding the unique needs of small businesses.

Continuously Updated Data

Our dataset is maintained and updated regularly to ensure relevance and accuracy in fast-changing market conditions. New business contacts are added frequently, helping you stay ahead of the competition.

Why Choose Success.ai?

At Success.ai, we understand the critical importance of high-quality data for your business success. Here’s why our dataset stands out:

Tailored for Small Business Engagement Focused specifically on North American small business owners, this dataset is an invaluable resource for building relationships with SMEs (Small and Medium Enterprises). Whether you’re targeting startups, local businesses, or established small enterprises, our dataset has you covered.

Comprehensive Coverage Across North America Spanning the United States, Canada, and Mexico, our dataset ensures wide-reaching access to verified small business contacts in the region.

Categories Tailored to Your Needs Includes highly relevant categories such as Small Business Contact Data, CEO Contact Data, B2B Contact Data, and Email Address Data to match your marketing and sales strategies.

Customizable and Flexible Choose from a wide range of filtering options to create datasets that meet your exact specifications, including filtering by industry, company size, geographic location, and more.

Best Price Guaranteed We pride ourselves on offering the most competitive rates without compromising on quality. When you partner with Success.ai, you receive superior data at the best value.

Seamless Integration Delivered in formats that integrate effortlessly with your CRM, marketing automation, or sales platforms, so you can start acting on the data immediately.

Use Cases: This dataset empowers you to:

Drive Sales Growth: Build and refine your sales pipeline by connecting directly with decision-makers in small businesses. Optimize Marketing Campaigns: Launch highly targeted email and phone outreach campaigns with verified contact data. Expand Your Network: Leverage the dataset to build relationships with small business owners and other key figures within the B2B landscape. Improve Data Accuracy: Enhance your existing databases with verified, enriched contact information, reducing bounce rates and increasing ROI. Industries Served: Whether you're in B2B SaaS, digital marketing, consulting, or any field requiring accurate and targeted contact data, this dataset serves industries of all kinds. It is especially useful for professionals focused on:

Lead Generation Business Development Market Research Sales Outreach Customer Acquisition What’s Included in the Dataset: Each profile provides:

Full Name Verified Email Address Phone Number (where available) Job Title Company Name Industry Company Size Location Skills and Professional Experience Education Background With over 170 million profiles, you can tap into a wealth of opportunities to expand your reach and grow your business.

Why High-Quality Contact Data Matters: Accurate, verified contact data is the foundation of any successful B2B strategy. Reaching small business owners and decision-makers directly ensures your message lands where it matters most, reducing costs and improving the effectiveness of your campaigns. By choosing Success.ai, you ensure that every contact in your pipeline is a genuine opportunity.

Partner with Success.ai for Better Data, Better Results: Success.ai is committed to delivering premium-quality B2B data solutions at scale. With our small business owner dataset, you can unlock the potential of North America's dynamic small business market.

Get Started Today Request a sample or customize your dataset to fit your unique...
f
Data from: Database Creator for Mass Analysis of Peptides and Proteins,...
figshare.com
acs.figshare.com
txt
Updated Aug 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pandi Boomathi Pandeswari; Arnold Emerson Isaac; Varatharajan Sabareesh (2023). Database Creator for Mass Analysis of Peptides and Proteins, DC-MAPP: A Standalone Tool for Simplifying Manual Analysis of Mass Spectral Data to Identify Peptide/Protein Sequences [Dataset]. http://doi.org/10.1021/jasms.3c00030.s005
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1021/jasms.3c00030.s005
Dataset updated
Aug 1, 2023
Dataset provided by
ACS Publications
Authors
Pandi Boomathi Pandeswari; Arnold Emerson Isaac; Varatharajan Sabareesh
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Proteomic studies typically involve the use of different types of software for annotating experimental tandem mass spectrometric data (MS/MS) and thereby simplifying the process of peptide and protein identification. For such annotations, these softwares calculate the m/z values of the peptide/protein precursor and fragment ions, for which a database of protein sequences must be provided as an input file. The calculated m/z values are stored as another database, which the user usually cannot view. Database Creator for Mass Analysis of Peptides and Proteins (DC-MAPP) is a novel standalone software that can create custom databases for “viewing” the calculated m/z values of precursor and fragment ions, prior to the database search. It contains three modules. Peptide/Protein sequences as per user’s choice can be entered as input to the first module for creating a custom database. In the second module, m/z values must be queried-in, which are searched within the custom database to identify protein/peptide sequences. The third module is suited for peptide mass fingerprinting, which can be used to analyze both ESI and MALDI mass spectral data. The feature of “viewing” the custom database can be helpful not only for better understanding the search engine processes, but also for designing multiple reaction monitoring (MRM) methods. Post-translational modifications and protein isoforms can also be analyzed. Since, DC-MAPP relies on the protein/peptide “sequences” for creating custom databases, it may not be applicable for the searches involving spectral libraries. Python language was used for implementation, and the graphical user interface was built with Page/Tcl, making this tool more user-friendly. It is freely available at https://vit.ac.in/DC-MAPP/.
Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter...
zenodo.org
explore.openaire.eu
bz2
Updated Mar 15, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana (2021). Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks [Dataset]. http://doi.org/10.5281/zenodo.2592524
Explore at:
bz2Available download formats
Unique identifier
https://doi.org/10.5281/zenodo.2592524
Dataset updated
Mar 15, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub.

Paper: https://2019.msrconf.org/event/msr-2019-papers-a-large-scale-study-about-quality-and-reproducibility-of-jupyter-notebooks

This repository contains two files:

dump.tar.bz2

jupyter_reproducibility.tar.bz2

The dump.tar.bz2 file contains a PostgreSQL dump of the database, with all the data we extracted from the notebooks.

The jupyter_reproducibility.tar.bz2 file contains all the scripts we used to query and download Jupyter Notebooks, extract data from them, and analyze the data. It is organized as follows:

analyses: this folder has all the notebooks we use to analyze the data in the PostgreSQL database.

archaeology: this folder has all the scripts we use to query, download, and extract data from GitHub notebooks.

paper: empty. The notebook analyses/N12.To.Paper.ipynb moves data to it

In the remaining of this text, we give instructions for reproducing the analyses, by using the data provided in the dump and reproducing the collection, by collecting data from GitHub again.

Reproducing the Analysis

This section shows how to load the data in the database and run the analyses notebooks. In the analysis, we used the following environment:

Ubuntu 18.04.1 LTS
PostgreSQL 10.6
Conda 4.5.11
Python 3.7.2
PdfCrop 2012/11/02 v1.38

First, download dump.tar.bz2 and extract it:

tar -xjf dump.tar.bz2

It extracts the file db2019-03-13.dump. Create a database in PostgreSQL (we call it "jupyter"), and use psql to restore the dump:

psql jupyter < db2019-03-13.dump

It populates the database with the dump. Now, configure the connection string for sqlalchemy by setting the environment variable JUP_DB_CONNECTTION:

export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter";

Download and extract jupyter_reproducibility.tar.bz2:

tar -xjf jupyter_reproducibility.tar.bz2

Create a conda environment with Python 3.7:

conda create -n analyses python=3.7 conda activate analyses

Go to the analyses folder and install all the dependencies of the requirements.txt

cd jupyter_reproducibility/analyses pip install -r requirements.txt

For reproducing the analyses, run jupyter on this folder:

jupyter notebook

Execute the notebooks on this order:

Index.ipynb

N0.Repository.ipynb

N1.Skip.Notebook.ipynb

N2.Notebook.ipynb

N3.Cell.ipynb

N4.Features.ipynb

N5.Modules.ipynb

N6.AST.ipynb

N7.Name.ipynb

N8.Execution.ipynb

N9.Cell.Execution.Order.ipynb

N10.Markdown.ipynb

N11.Repository.With.Notebook.Restriction.ipynb

N12.To.Paper.ipynb

Reproducing or Expanding the Collection

The collection demands more steps to reproduce and takes much longer to run (months). It also involves running arbitrary code on your machine. Proceed with caution.

Requirements

This time, we have extra requirements:

All the analysis requirements
lbzip2 2.5
gcc 7.3.0
Github account
Gmail account

Environment

First, set the following environment variables:

export JUP_MACHINE="db"; # machine identifier export JUP_BASE_DIR="/mnt/jupyter/github"; # place to store the repositories export JUP_LOGS_DIR="/home/jupyter/logs"; # log files export JUP_COMPRESSION="lbzip2"; # compression program export JUP_VERBOSE="5"; # verbose level export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; # sqlchemy connection export JUP_GITHUB_USERNAME="github_username"; # your github username export JUP_GITHUB_PASSWORD="github_password"; # your github password export JUP_MAX_SIZE="8000.0"; # maximum size of the repositories directory (in GB) export JUP_FIRST_DATE="2013-01-01"; # initial date to query github export JUP_EMAIL_LOGIN="gmail@gmail.com"; # your gmail address export JUP_EMAIL_TO="target@email.com"; # email that receives notifications export JUP_OAUTH_FILE="~/oauth2_creds.json" # oauth2 auhentication file export JUP_NOTEBOOK_INTERVAL=""; # notebook id interval for this machine. Leave it in blank export JUP_REPOSITORY_INTERVAL=""; # repository id interval for this machine. Leave it in blank export JUP_WITH_EXECUTION="1"; # run execute python notebooks export JUP_WITH_DEPENDENCY="0"; # run notebooks with and without declared dependnecies export JUP_EXECUTION_MODE="-1"; # run following the execution order export JUP_EXECUTION_DIR="/home/jupyter/execution"; # temporary directory for running notebooks export JUP_ANACONDA_PATH="~/anaconda3"; # conda installation path export JUP_MOUNT_BASE="/home/jupyter/mount_ghstudy.sh"; # bash script to mount base dir export JUP_UMOUNT_BASE="/home/jupyter/umount_ghstudy.sh"; # bash script to umount base dir export JUP_NOTEBOOK_TIMEOUT="300"; # timeout the extraction # Frequenci of log report export JUP_ASTROID_FREQUENCY="5"; export JUP_IPYTHON_FREQUENCY="5"; export JUP_NOTEBOOKS_FREQUENCY="5"; export JUP_REQUIREMENT_FREQUENCY="5"; export JUP_CRAWLER_FREQUENCY="1"; export JUP_CLONE_FREQUENCY="1"; export JUP_COMPRESS_FREQUENCY="5"; export JUP_DB_IP="localhost"; # postgres database IP

Then, configure the file ~/oauth2_creds.json, according to yagmail documentation: https://media.readthedocs.org/pdf/yagmail/latest/yagmail.pdf

Configure the mount_ghstudy.sh and umount_ghstudy.sh scripts. The first one should mount the folder that stores the directories. The second one should umount it. You can leave the scripts in blank, but it is not advisable, as the reproducibility study runs arbitrary code on your machine and you may lose your data.

Scripts

Download and extract jupyter_reproducibility.tar.bz2:

tar -xjf jupyter_reproducibility.tar.bz2

Install 5 conda environments and 5 anaconda environments, for each python version. In each of them, upgrade pip, install pipenv, and install the archaeology package (Note that it is a local package that has not been published to pypi. Make sure to use the -e option):

Conda 2.7

conda create -n raw27 python=2.7 -y conda activate raw27 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Anaconda 2.7

conda create -n py27 python=2.7 anaconda -y conda activate py27 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.4

It requires a manual jupyter and pathlib2 installation due to some incompatibilities found on the default installation.

conda create -n raw34 python=3.4 -y conda activate raw34 conda install jupyter -c conda-forge -y conda uninstall jupyter -y pip install --upgrade pip pip install jupyter pip install pipenv pip install -e jupyter_reproducibility/archaeology pip install pathlib2

Anaconda 3.4

conda create -n py34 python=3.4 anaconda -y conda activate py34 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.5

conda create -n raw35 python=3.5 -y conda activate raw35 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Anaconda 3.5

It requires the manual installation of other anaconda packages.

conda create -n py35 python=3.5 anaconda -y conda install -y appdirs atomicwrites keyring secretstorage libuuid navigator-updater prometheus_client pyasn1 pyasn1-modules spyder-kernels tqdm jeepney automat constantly anaconda-navigator conda activate py35 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.6

conda create -n raw36 python=3.6 -y conda activate raw36 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Anaconda 3.6

conda create -n py36 python=3.6 anaconda -y conda activate py36 conda install -y anaconda-navigator jupyterlab_server navigator-updater pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.7

<code
130k Images (128x128) - Universal Image Embeddings
kaggle.com
Updated Jul 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rohit singh (2022). 130k Images (128x128) - Universal Image Embeddings [Dataset]. https://www.kaggle.com/datasets/rhtsingh/google-universal-image-embeddings-128x128/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 29, 2022
Dataset provided by
Kaggle
Authors
Rohit singh
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Introduction This is my scraped, collected, and curated dataset for the Google Universal Image Embedding competition resized to 128x128. It contains 130k+ images in total and below provides a count for each class -

Data Count | apparel | artwork | cars | dishes | furniture | illustrations | landmark | meme | packaged | storefronts | toys | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | 32,226| 4,957 | 8,144 | 5,831 | 10,488 | 3,347 | 33,063 | 3,301 | 23,382 | 5,387 | 2,402 |

Data Source 1. Apparel - Deep Fashion Dataset 2. Artwork - Google Scrapped 3. Cars - Stanford Cars Dataset 4. Dishes - Google Scrapped 5. Furniture - Google Scrapped 6. Illustrations - Google Scrapped 7. Landmark - Google Landmark Dataset 8. Meme - Google Scrapped 9. Packaged - Holosecta, Grozi 3.2k, Freiburg Groceries, SKU110K 10. Storefronts - Google Scrapped 11. Toys - Google Scrapped
Data Modeling Software Market Report | Global Forecast From 2025 To 2033
dataintelo.com
csv, pdf, pptx
Updated Oct 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2024). Data Modeling Software Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/data-modeling-software-market
Explore at:
pptx, csv, pdfAvailable download formats
Dataset updated
Oct 16, 2024
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
Data Modeling Software Market Outlook

The global data modeling software market size was valued at approximately USD 2.5 billion in 2023 and is projected to reach around USD 6.8 billion by 2032, growing at a compound annual growth rate (CAGR) of 11.5% from 2024 to 2032. The market's robust growth can be attributed to the increasing adoption of data-driven decision-making processes across various industries, which necessitates advanced data modeling solutions to manage and analyze large volumes of data efficiently.

The proliferation of big data and the growing need for data governance are significant drivers for the data modeling software market. Organizations are increasingly recognizing the importance of structured and unstructured data in generating valuable insights. With data volumes exploding, data modeling software becomes essential for creating logical data models that represent business processes and information requirements accurately. This software is crucial for implementation in data warehouses, analytics, and business intelligence applications, further fueling market growth.

Technological advancements, particularly in artificial intelligence (AI) and machine learning (ML), are also propelling the data modeling software market forward. These technologies enable more sophisticated data models that can predict trends, optimize operations, and enhance decision-making processes. The integration of AI and ML with data modeling tools allows for automated data analysis, reducing the time and effort required for manual processes and improving the accuracy of the results. This technological synergy is a significant growth factor for the market.

The rise of cloud-based solutions is another critical factor contributing to the market's expansion. Cloud deployment offers numerous advantages, such as scalability, flexibility, and cost-effectiveness, making it an attractive option for businesses of all sizes. Cloud-based data modeling software allows for real-time collaboration and access to data from anywhere, enhancing productivity and efficiency. As more companies move their operations to the cloud, the demand for cloud-compatible data modeling solutions is expected to surge, driving market growth further.

In terms of regional outlook, North America currently holds the largest share of the data modeling software market. This dominance is due to the high concentration of technology-driven enterprises and a strong emphasis on data analytics and business intelligence in the region. However, the Asia Pacific region is anticipated to witness the highest growth rate during the forecast period. Rapid digital transformation, increased cloud adoption, and the rising importance of data analytics in emerging economies like China and India are key factors contributing to this growth. Europe, Latin America, and the Middle East & Africa also present significant opportunities, albeit at varying growth rates.

Component Analysis

In the data modeling software market, the component segment is divided into software and services. The software component is the most significant contributor to the market, driven by the increasing need for advanced data modeling tools that can handle complex data structures and provide accurate insights. Data modeling software includes various tools and platforms that facilitate the creation, management, and optimization of data models. These tools are essential for database design, data architecture, and other data management tasks, making them indispensable for organizations aiming to leverage their data assets effectively.

Within the software segment, there is a growing trend towards integrating AI and ML capabilities to enhance the functionality of data modeling tools. This integration allows for more sophisticated data analysis, automated model generation, and improved accuracy in predictions and insights. As a result, organizations can achieve better data governance, streamline operations, and make more informed decisions. The demand for such advanced software solutions is expected to rise, contributing significantly to the market's growth.

The services component, although smaller in comparison to the software segment, plays a crucial role in the data modeling software market. Services include consulting, implementation, training, and support, which are essential for the successful deployment and utilization of data modeling tools. Many organizations lack the in-house expertise to effectively implement and manage data modeling software, leading to increased demand for professional services.
AI Training Dataset Market Report | Global Forecast From 2025 To 2033
dataintelo.com
csv, pdf, pptx
Updated Jan 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2025). AI Training Dataset Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/global-ai-training-dataset-market
Explore at:
csv, pptx, pdfAvailable download formats
Dataset updated
Jan 7, 2025
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
AI Training Dataset Market Outlook

The global AI training dataset market size was valued at approximately USD 1.2 billion in 2023 and is projected to reach USD 6.5 billion by 2032, growing at a compound annual growth rate (CAGR) of 20.5% from 2024 to 2032. This substantial growth is driven by the increasing adoption of artificial intelligence across various industries, the necessity for large-scale and high-quality datasets to train AI models, and the ongoing advancements in AI and machine learning technologies.

One of the primary growth factors in the AI training dataset market is the exponential increase in data generation across multiple sectors. With the proliferation of internet usage, the expansion of IoT devices, and the digitalization of industries, there is an unprecedented volume of data being generated daily. This data is invaluable for training AI models, enabling them to learn and make more accurate predictions and decisions. Moreover, the need for diverse and comprehensive datasets to improve AI accuracy and reliability is further propelling market growth.

Another significant factor driving the market is the rising investment in AI and machine learning by both public and private sectors. Governments around the world are recognizing the potential of AI to transform economies and improve public services, leading to increased funding for AI research and development. Simultaneously, private enterprises are investing heavily in AI technologies to gain a competitive edge, enhance operational efficiency, and innovate new products and services. These investments necessitate high-quality training datasets, thereby boosting the market.

The proliferation of AI applications in various industries, such as healthcare, automotive, retail, and finance, is also a major contributor to the growth of the AI training dataset market. In healthcare, AI is being used for predictive analytics, personalized medicine, and diagnostic automation, all of which require extensive datasets for training. The automotive industry leverages AI for autonomous driving and vehicle safety systems, while the retail sector uses AI for personalized shopping experiences and inventory management. In finance, AI assists in fraud detection and risk management. The diverse applications across these sectors underline the critical need for robust AI training datasets.

As the demand for AI applications continues to grow, the role of Ai Data Resource Service becomes increasingly vital. These services provide the necessary infrastructure and tools to manage, curate, and distribute datasets efficiently. By leveraging Ai Data Resource Service, organizations can ensure that their AI models are trained on high-quality and relevant data, which is crucial for achieving accurate and reliable outcomes. The service acts as a bridge between raw data and AI applications, streamlining the process of data acquisition, annotation, and validation. This not only enhances the performance of AI systems but also accelerates the development cycle, enabling faster deployment of AI-driven solutions across various sectors.

Regionally, North America currently dominates the AI training dataset market due to the presence of major technology companies and extensive R&D activities in the region. However, Asia Pacific is expected to witness the highest growth rate during the forecast period, driven by rapid technological advancements, increasing investments in AI, and the growing adoption of AI technologies across various industries in countries like China, India, and Japan. Europe and Latin America are also anticipated to experience significant growth, supported by favorable government policies and the increasing use of AI in various sectors.

Data Type Analysis

The data type segment of the AI training dataset market encompasses text, image, audio, video, and others. Each data type plays a crucial role in training different types of AI models, and the demand for specific data types varies based on the application. Text data is extensively used in natural language processing (NLP) applications such as chatbots, sentiment analysis, and language translation. As the use of NLP is becoming more widespread, the demand for high-quality text datasets is continually rising. Companies are investing in curated text datasets that encompass diverse languages and dialects to improve the accuracy and efficiency of NLP models.

Image data is critical for computer vision application
Lending Club Loan Data Analysis - Deep Learning
kaggle.com
Updated Aug 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Deependra Verma (2023). Lending Club Loan Data Analysis - Deep Learning [Dataset]. https://www.kaggle.com/datasets/deependraverma13/lending-club-loan-data-analysis-deep-learning
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 9, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Deependra Verma
Description
DESCRIPTION

Create a model that predicts whether or not a loan will be default using the historical data.

Problem Statement:

For companies like Lending Club correctly predicting whether or not a loan will be a default is very important. In this project, using the historical data from 2007 to 2015, you have to build a deep learning model to predict the chance of default for future loans. As you will see later this dataset is highly imbalanced and includes a lot of features that make this problem more challenging.

Domain: Finance

Analysis to be done: Perform data preprocessing and build a deep learning prediction model.

Content:

Dataset columns and definition:

credit.policy: 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise.

purpose: The purpose of the loan (takes values "credit_card", "debt_consolidation", "educational", "major_purchase", "small_business", and "all_other").

int.rate: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates.

installment: The monthly installments owed by the borrower if the loan is funded.

log.annual.inc: The natural log of the self-reported annual income of the borrower.

dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income).

fico: The FICO credit score of the borrower.

days.with.cr.line: The number of days the borrower has had a credit line.

revol.bal: The borrower's revolving balance (amount unpaid at the end of the credit card billing cycle).

revol.util: The borrower's revolving line utilization rate (the amount of the credit line used relative to total credit available).

inq.last.6mths: The borrower's number of inquiries by creditors in the last 6 months.

delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years.

pub.rec: The borrower's number of derogatory public records (bankruptcy filings, tax liens, or judgments).

Steps to perform:

Perform exploratory data analysis and feature engineering and then apply feature engineering. Follow up with a deep learning model to predict whether or not the loan will be default using the historical data.

Tasks:

Feature Transformation

Transform categorical values into numerical values (discrete)

Exploratory data analysis of different factors of the dataset.

Additional Feature Engineering

You will check the correlation between features and will drop those features which have a strong correlation

This will help reduce the number of features and will leave you with the most relevant features

Modeling

After applying EDA and feature engineering, you are now ready to build the predictive models

In this part, you will create a deep learning model using Keras with Tensorflow backend
Data Science Platform Market Report | Global Forecast From 2025 To 2033
dataintelo.com
csv, pdf, pptx
Updated Oct 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2024). Data Science Platform Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/data-science-platform-market
Explore at:
pdf, csv, pptxAvailable download formats
Dataset updated
Oct 16, 2024
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
Data Science Platform Market Outlook

The global data science platform market size was valued at approximately USD 49.3 billion in 2023 and is projected to reach USD 174.4 billion by 2032, growing at a compound annual growth rate (CAGR) of 15.1% during the forecast period. This exponential growth can be attributed to the increasing demand for data-driven decision-making processes, the surge in big data technologies, and the need for more advanced analytics solutions across various industries.

One of the primary growth factors driving the data science platform market is the rapid digital transformation efforts undertaken by organizations globally. Companies are shifting towards data-centric business models to gain a competitive edge, improve operational efficiency, and enhance customer experiences. The proliferation of IoT devices and the subsequent explosion of data generated have further propelled the need for sophisticated data science platforms capable of analyzing vast datasets in real-time. This transformation is not only seen in large enterprises but also increasingly in small and medium enterprises (SMEs) that recognize the potential of data analytics in driving business growth.

Moreover, the advancements in artificial intelligence (AI) and machine learning (ML) technologies have significantly augmented the capabilities of data science platforms. These technologies enable the automation of complex data analysis processes, allowing for more accurate predictions and insights. As a result, sectors such as healthcare, finance, and retail are increasingly adopting data science solutions to leverage AI and ML for personalized services, fraud detection, and supply chain optimization. The integration of AI/ML into data science platforms is thus a critical factor contributing to market growth.

Another crucial factor is the growing regulatory and compliance requirements across various industries. Organizations are mandated to ensure data accuracy, security, and privacy, necessitating the adoption of robust data science platforms that can handle these aspects efficiently. The implementation of regulations such as the General Data Protection Regulation (GDPR) in Europe and the California Consumer Privacy Act (CCPA) in the United States has compelled organizations to invest in advanced data management and analytics solutions. These regulatory frameworks are not only a challenge but also an opportunity for the data science platform market to innovate and provide compliant solutions.

Regionally, North America dominates the data science platform market due to the early adoption of advanced technologies, a strong presence of key market players, and significant investments in research and development. However, the Asia Pacific region is expected to witness the highest growth rate during the forecast period. This growth can be attributed to the increasing digitalization initiatives, a growing number of tech startups, and the rising demand for analytics solutions in countries like China, India, and Japan. The competitive landscape and economic development in these regions are creating ample opportunities for market expansion.

Component Analysis

The data science platform market, segmented by components, includes platforms and services. The platform segment encompasses software and tools designed for data integration, preparation, and analysis, while the services segment covers professional and managed services that support the implementation and maintenance of these platforms. The platform component is crucial as it provides the backbone for data science operations, enabling data scientists to perform data wrangling, model building, and deployment efficiently. The increasing demand for customized solutions tailored to specific business needs is driving the growth of the platform segment. Additionally, with the rise of open-source platforms, organizations have more flexibility and control over their data science workflows, further propelling this segment.

On the other hand, the services segment is equally vital as it ensures that organizations can effectively deploy and utilize data science platforms. Professional services include consulting, training, and support, which help organizations in the seamless integration of data science solutions into their existing IT infrastructure. Managed services provide ongoing support and maintenance, ensuring data science platforms operate optimally. The rising complexity of data ecosystems and the shortage of skilled data scientists are factors contributing to the growth of the services segment, as organizations often rely on external expert
Data from: Create a Story Map
hubexamples-dcdev.hub.arcgis.com
Updated Jul 16, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ESRI R&D Center (2018). Create a Story Map [Dataset]. https://hubexamples-dcdev.hub.arcgis.com/datasets/create-a-story-map
Explore at:
Dataset updated
Jul 16, 2018
Dataset provided by
Esrihttp://esri.com/
Authors
ESRI R&D Center
Description
DO NOT DELETE OR MODIFY THIS ITEM. This item is managed by the ArcGIS Hub application. Create your own initiative by combining existing applications with a custom site.
U
Polygon shapefile of data sources used to create a composite multibeam...
data.usgs.gov
catalog.data.gov
Updated Aug 23, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Peter Dartnell; James Conrad; Janet Watt; Jenna Hill (2021). Polygon shapefile of data sources used to create a composite multibeam bathymetry surface of the southern Cascadia Margin offshore Oregon and northern California [Dataset]. http://doi.org/10.5066/P9C5DBMR
Explore at:
Unique identifier
https://doi.org/10.5066/P9C5DBMR
Dataset updated
Aug 23, 2021
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Authors
Peter Dartnell; James Conrad; Janet Watt; Jenna Hill
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Time period covered
1996 - 2019
Area covered
Northern California, Oregon, California
Description
This polygon shapefile describes the data sources used to create a composite 30-m resolution multibeam bathymetry surface of southern Cascadia Margin offshore Oregon and northern California.

Facebook

Twitter

Click to copy link

Link copied

Cite

Jo Yen Tan; Jo Yen Tan (2022). VegeNet - Image datasets and Codes [Dataset]. http://doi.org/10.5281/zenodo.7254508

VegeNet - Image datasets and Codes

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.7254508

Dataset updated

Oct 27, 2022

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Jo Yen Tan; Jo Yen Tan

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Compilation of python codes for data preprocessing and VegeNet building, as well as image datasets (zip files).

Image datasets:

vege_original : Images of vegetables captured manually in data acquisition stage
vege_cropped_renamed : Images in (1) cropped to remove background areas and image labels renamed
non-vege images : Images of non-vegetable foods for CNN network to recognize other-than-vegetable foods
food_image_dataset : Complete set of vege (2) and non-vege (3) images for architecture building.
food_image_dataset_split : Image dataset (4) split into train and test sets
process : Images created when cropping (pre-processing step) to create dataset (2).

Clear search

Close search

Google apps

Main menu

VegeNet - Image datasets and Codes

Amount of data created, consumed, and stored 2010-2023, with forecasts to...

02.1 Integrating Data in ArcGIS Pro

Synthetic Data for an Imaginary Country, Sample, 2023 - World

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Cleaning operations

Response rate

Generate-Distill-Data

R codes and dataset for Visualisation of Diachronic Constructional Change...

Footprints and producers of source data used to create southern portion of...

Synthetic Integrated Services Data

Motivation

Collection

Preprocessing

Recommended Uses

Known Limitations/Biases

Feedback

Further Documentation and Resources

How to create an Okta Account

Data from: 10-m backscatter mosaic produced from backscatter intensity data...

Small Business Contact Data | North American Small Business Owners |...

Data from: Database Creator for Mass Analysis of Peptides and Proteins,...

Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter...

130k Images (128x128) - Universal Image Embeddings

Data Modeling Software Market Report | Global Forecast From 2025 To 2033

Data Modeling Software Market Outlook

Component Analysis

AI Training Dataset Market Report | Global Forecast From 2025 To 2033

AI Training Dataset Market Outlook

Data Type Analysis

Lending Club Loan Data Analysis - Deep Learning

Data Science Platform Market Report | Global Forecast From 2025 To 2033

Data Science Platform Market Outlook

Component Analysis

Data from: Create a Story Map

Polygon shapefile of data sources used to create a composite multibeam...

VegeNet - Image datasets and Codes