11 datasets found

The bAbI tasks data
kaggle.com
zip
Updated May 16, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
roblex nana (2020). The bAbI tasks data [Dataset]. https://www.kaggle.com/roblexnana/the-babi-tasks-for-nlp-qa-system
Explore at:
zip(17499991 bytes)Available download formats
Dataset updated
May 16, 2020
Authors
roblex nana
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
Context

This dataset presents the first set of 20 tasks for testing text understanding and reasoning in the bAbI project. The tasks are described in detail in the paper: Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M. Rush, Bart van Merriënboer, Armand Joulin and Tomas Mikolov. Towards AI Complete Question Answering: A Set of Prerequisite Toy Tasks, arXiv:1502.05698.

Please also see the following slides: Antoine Bordes Artificial Tasks for Artificial Intelligence, ICLR keynote, 2015.

The aim is that each task tests a unique aspect of text and reasoning, and hence test different capabilities of learning models. More tasks are planned in the future to capture more aspects.

Content

Training Set Size: For each task, there are 1000 questions for training, and 1000 for testing. However, we emphasize that the goal is to use as little data as possible to do well on the tasks (i.e. if you can use less than 1000 that’s even better) — and without resorting to engineering task-specific tricks that will not generalize to other tasks, as they may not be of much use subsequently. Note that the aim during evaluation is to use the same learner across all tasks to evaluate its skills and capabilities.

Supervision Signal: Further while the MemNN results in the paper use full supervision (including of the supporting facts) results with weak supervision would also be ultimately preferable as this kind of data is easier to collect. Hence results of that form are very welcome. E.g. this paper does include weakly supervised results.

For the reasons above there are currently several directories:

1) en/ — the tasks in English, readable by humans. 2) hn/ — the tasks in Hindi, readable by humans. 3) shuffled/ — the same tasks with shuffled letters so they are not readable by humans, and for existing parsers and taggers cannot be used in a straight-forward fashion to leverage extra resources– in this case the learner is more forced to rely on the given training data. This mimics a learner being first presented a language and having to learn from scratch. 4) en-10k/ shuffled-10k/ and hn-10k/ — the same tasks in the three formats, but with 10,000 training examples, rather than 1000 training examples. Note the results in the paper use 1000 training examples.

Versions: Some small updates since the original release have been made (see the README in the data download for more details). You can also get v1.0 and v1.1 here.

Acknowledgements

We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

Inspiration

The aim is to encourage the machine learning community to work on, and develop more, of these tasks.

References

https://research.fb.com/downloads/babi/
c
Oceanographic XBT Data Device Location for Joint USGS Cruise 03008 and NOAA...
s.cnmilf.com
data.usgs.gov
+2more
Updated Oct 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Oceanographic XBT Data Device Location for Joint USGS Cruise 03008 and NOAA RB0303 [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/oceanographic-xbt-data-device-location-for-joint-usgs-cruise-03008-and-noaa-rb0303
Explore at:
Dataset updated
Oct 1, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
"The Expendable Bathythermograph (XBT) has been used by oceanographers for many years to obtain information on the temperature structure of the ocean to depths of up to 1500 meters. The XBT... is a probe which is dropped from a ship and measures the temperature as it falls through the water. Two very small wires transmit the temperature data to the ship where it is recorded for later analysis. The probe is designed to fall at a constant rate, so that the depth of the probe can be inferred from the time since it was launched. By plotting temperature as a function of depth, the [National Oceanic and Atmospheric Administration and U.S. Geological Survey] scientists can get a picture of the temperature profile of the water." (http://www.aoml.noaa.gov/phod/uot/uot_xbt.html). The XBT device and _location where it was dropped was engineered by the USGS Science Cruise 03008 in collaboration with NOAA Research Cruise RB0303 from 18 February to 7 March 2003, Leg II of III. (Leg I and III: 20020924 to 20020930 and 20030828 to 20030904, respectively). This data set is in shapefile format.
Cannabis - Consumer ,Producer dataset
kaggle.com
zip
Updated Nov 8, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jayanth.K.M (2020). Cannabis - Consumer ,Producer dataset [Dataset]. https://www.kaggle.com/benten867/cannabis-data
Explore at:
zip(290301 bytes)Available download formats
Dataset updated
Nov 8, 2020
Authors
Jayanth.K.M
Description
Context

Its been two years since the news that Canada has legalized weed hit us, so I was like why don't we get a dataset from Kaggle to practice a bit of data analysis and to my surprise I cannot find a weed dataset which reflects the economics behind legalized weed and how it has changed over time ,so I just went to the Canadian govt data site , and ola they have CSV files on exactly what I wanted floating around on their website and all I did was to download it straight up, and here I am to share it with the community.

Content

We have a series of CSV files each having data about things like supply, use case, production, etc but before we go into the individual files there are a few data columns which are common to all csv files

Ref_date: Reference period for the series being released

Dimension Name: Name of dimension. There can be up to 10 dimensions in a data table.

DGUID: Dissemination Geography Unique Identifier.

Unit of measure: The unit of measure applied to a member given in the text. There can be multiple units of measure in a table.

Unit of measure ID: The unique reference code associated with a particular unit of measure.

Scalar factor: The scalar factor associated with a data series, displayed as text. There can be multiple scalar factors in a table.

Scalar_ID: The unique numeric reference code associated with a particular scalar factor.

Vector: Unique variable length reference code time-series identifier, consisting of the letter 'V', followed by up to 10 digits.

Coordinate: Concatenation of the member ID values for each dimension.

Status: Shows various states of a data value using symbols. These symbols are described in the symbol legend and notes contained in the metadata file. Some symbols accompany a data value while others replace a data value. i.e. – A, B, C, D, E, F,.., X, 0s

Symbol: Describes data points that are preliminary or revised, displayed using the symbols p and r. These symbols accompany a data.

Understanding metadata files:

Cube Title: The title of the table. The output files are unilingual and thus will contain either the English or French title.

Product Id (PID): The unique 8 digit product identifier for the table.

CANSIM Id: The ID number which formally identified the table in CANSIM. (where applicable)

URL: The URL for the representative (default) view of a given data table.

Cube Notes: Each note is assigned a unique number. This field indicates which notes, if any, are applied to the entire table.

Archive Status: Describes the status of a table as either 'Current' or 'Archived'. Archived tables are those that are no longer updated.

Frequency: Frequency of the table. (i.e. annual)

Start Reference Period: The starting reference period for the table.

End Reference Period: The end reference period for the table.

Total Number of Dimensions: The total number of dimensions contained in the table.

Dimension Name: The name of a dimension in a table. There can be up to 10 dimensions in a table. (i.e. – Geography)

Dimension ID: The reference code assigned to a dimension in a table. A unique reference Dimension ID code is assigned to each dimension in a table.

Dimension Notes: Each note is assigned a unique number. This field indicates which notes are applied to a particular dimension.

Dimension Definitions: Reserved for future development.

Member Name: The textual description of the members in a dimension. (i.e. – Nova Scotia, Ontario (members of the Geography dimension))

Member ID: The code assigned to a member of a dimension. There is a unique ID for each member within a dimension. These IDs are used to create the coordinate field in the data file. (see the 'coordinate' field in the data record layout).

Classification (where applicable): Classification code for a member. Definitions, data sources and methods

Parent Member ID: The code used to display the hierarchical relationship between members in a dimension. (i.e. – The member Ontario (5) is a child of the member Canada (1) in the dimension 'Geography')

Terminated: Indicates whether a member has been terminated or not. Terminated members are those that are no longer updated.

Member Notes: Each note is assigned a unique number. This field indicates which notes are applied to each member.

Member definitions: Reserved for future development.

Symbol Legend: The symbol legend provides descriptions of the various symbols which can appear in a table. This field describes a comprehensive list of all possible symbols, regardless of whether a selected symbol appears in a particular table.

Survey Code: The unique code associated with a survey or program from which the data in the table is derived. Data displayed in one table may be derived ...
f
Data from: Estimation of Unit Process Data for Life Cycle Assessment Using a...
acs.figshare.com
xlsx
Updated Jun 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bu Zhao; Chenyang Shuai; Ping Hou; Shen Qu; Ming Xu (2023). Estimation of Unit Process Data for Life Cycle Assessment Using a Decision Tree-Based Approach [Dataset]. http://doi.org/10.1021/acs.est.0c07484.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.est.0c07484.s001
Dataset updated
Jun 11, 2023
Dataset provided by
ACS Publications
Authors
Bu Zhao; Chenyang Shuai; Ping Hou; Shen Qu; Ming Xu
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Lacking unit process data is a major challenge for developing life cycle inventory (LCI) in life cycle assessment (LCA). Previously, we developed a similarity-based approach to estimate missing unit process data, which works only when less than 5% of the data are missing in a unit process. In this study, we developed a more flexible machine learning model to estimate missing unit process data as a complement to our previous method. In particular, we adopted a decision tree-based supervised learning approach to use an existing unit process dataset (ecoinvent 3.1) to characterize the relationship between the known information (predictors) and the missing one (response). The results show that our model can successfully classify the zero and nonzero flows with a very low misclassification rate (0.79% when 10% of the data are missing). For nonzero flows, the model can accurately estimate their values with an R2 over 0.7 when less than 20% of data are missing in one unit process. Our method can provide important data to complement primary LCI data for LCA studies and demonstrates the promising applications of machine learning techniques in LCA.
High resolution spectral data predicts taxonomic diversity in low diversity...
data.niaid.nih.gov
search.dataone.org
+1more
zip
Updated Jul 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Meghan Hayden; Elisa Van Cleemput; Katharine Suding; Ann Lezberg; Brian Anacker; Laura Dee (2024). High resolution spectral data predicts taxonomic diversity in low diversity grasslands [Dataset]. http://doi.org/10.5061/dryad.z08kprrp0
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.z08kprrp0
Dataset updated
Jul 8, 2024
Dataset provided by
University of Colorado Boulder
City of Boulder Open Space & Mountain Parks
Leiden University
Authors
Meghan Hayden; Elisa Van Cleemput; Katharine Suding; Ann Lezberg; Brian Anacker; Laura Dee
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Mitigating impacts of global change on biodiversity is a pressing goal for land managers but understanding these impacts is often limited by the spatial and temporal constraints of traditional in-situ data. Advances in remote sensing address this challenge, in part, by enabling standardized mapping of biodiversity at large spatial scales and through time. In particular, hyperspectral imagery can detect functional and compositional characteristics of vegetation by measuring subtle differences in reflected light. The spectral variance hypothesis (SVH) expects spectral diversity, or variability in reflectance across pixels, to predict vegetation diversity. Especially when assessing herbaceous ecosystems, however, there is inconsistent evidence for the SVH, potentially due to a mismatch between plant size and the traditionally coarse pixels of satellite and airborne imagery or variation in the biological characteristics of the observed ecosystems, such as vegetation structure and composition, which can impact spectral variability. However, the majority of research testing the SVH to date has been conducted in systems with controlled conditions or spatially homogenous assemblages, with little generalizability to heterogeneous real-world systems. Here, we move the field forward by testing the SVH in a species-rich system with high heterogeneity resulting from variable species composition and a recent fire. We use very high spatial resolution (~1 mm) hyperspectral imagery to compare spectrally derived estimates of vegetation diversity with in-situ measures collected in Boulder, CO, USA. We find that spectral diversity and taxonomic diversity are positively correlated only for low to moderate diversity transects, or in transects that were recently burned where vegetation diversity is low and composed primarily of C3 grasses. Additionally, we find that the relationship between spectral and taxonomic diversity depends on spatial resolution, indicating that pixel size should remain a priority for biodiversity monitoring. The context dependency of this relationship, even with high spatial resolution data, confirms previous work that the SVH does not hold across landscapes and demonstrates the necessity for repeated, high-resolution data in order to tease apart the biological conditions underpinning the SVH. With refinement, however, the remote sensing techniques described here will offer land managers a cost-effective approach to monitor biodiversity across space and time. Methods Data collection for this project built on a long-term grassland monitoring initiative in the city of Boulder, CO, USA, run by their Open Space & Mountain Parks department. The monitoring area covers a 2,000 ha mixed grass prairie in the Front Range of Colorado, USA (39°56 N, 105°12 W). Permanent plots (hereafter, ‘transects’) of 50 m x 2 m were established in 2009 to implement the city of Boulder’s Grassland Ecosystem Management Plan (GEMAP). Transects were surveyed using a panel design for the first 8 years and thereafter, every three years for vegetation cover and composition. Cover is surveyed with a point intercept method: along the 50 m center line in a transect, species identity (or soil/rock/litter) is recorded at each meter mark at 0.5 m from both sides of the centerline, resulting in a total of 100 points. Additionally, vascular species that are present in the rest of the plot, but not hit during the point intercept method, are noted. Here, we utilize data from the 17 transects surveyed in 2022, which was collected at peak biomass for this system between July 18 and August 23. Vegetation survey data is summarized as relative vegetation cover, species richness and an abundance-based taxonomic diversity metric, Shannon Index, to protect the location and identity of species according to Boulder OSMP policy. To calculate vegetation cover, we computed the percentage of points along the transect where vegetation was hit by excluding rock, litter, and bare soil. To compute species richness, we calculated the number of species identified in each transect both as those counted with the point intercept method (here, “hits”) and as hits plus those identified as present in the rest of the plot (here, “full species list”). We also computed the Shannon diversity index of both the hits and the full species list. Species in the full species list that were not counted as hits were given an abundance of 0.1 for incorporation in this abundance-based metric. We also collected proximal imagery of the 17 GEMAP transects using a handheld Specim IQ hyperspectral camera (Spectral Imaging Ltd., Konica Minolta; www.specim.com). Sampling was conducted from August 11th - August 31st, 2022, such that the maximum time between vegetation survey and imagery was 24 days (average time between: 7 days). This hyperspectral camera is equipped with a built-in pushbroom imaging system (a line scan camera system composed of an imaging spectrograph, grayscale camera and objective) that records reflectance in 204 bands spanning the visible to near infrared regions (400 to 1000 nm; 7 nm spectral resolution FWHM) with a 1 mm spatial resolution when mounted on a tripod at 1.5 m from the ground. For each image, we included a white reference material provided by Specim (with a reflectance close to 100%) in the frame of reference. Sampling was conducted on cloud free days and within 3 hours from solar noon. Raw data was transformed to reflectance using the reflectance from the white reference, the dark reference signal, and the imaging integration time. Images were taken of square meter plots in 2-meter increments along the transect. We ultimately combined pixels from all images within a transect to represent the transect-level data in the manuscript, but here archive the spectra at the "plot" level. For each plot, there is a .dat file representing reflectance in 204 bands across a 512 pixels by 512 pixels. Subsequent processing, including removal of bad bands and filtering for photosynthetically active vegetation, is detailed in the manuscript.
a
Satellite (VIIRS) Thermal Hotspots and Fire Activity
digital-earth-pacificcore.hub.arcgis.com
portal30x30.com
+25more
Updated Apr 2, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Esri (2020). Satellite (VIIRS) Thermal Hotspots and Fire Activity [Dataset]. https://digital-earth-pacificcore.hub.arcgis.com/datasets/esri2::satellite-viirs-thermal-hotspots-and-fire-activity
Explore at:
Dataset updated
Apr 2, 2020
Dataset authored and provided by
Esri
Area covered
Description
This layer presents detectable thermal activity from VIIRS satellites for the last 7 days. VIIRS Thermal Hotspots and Fire Activity is a product of NASA’s Land, Atmosphere Near real-time Capability for EOS (LANCE) Earth Observation Data, part of NASA's Earth Science Data. Consumption Best Practices:As a service that is subject to very high usage, ensure peak performance and accessibility of your maps and apps by avoiding the use of non-cacheable relative Date/Time field filters. To accommodate filtering events by Date/Time, we suggest using the included "Age" fields that maintain the number of days or hours since a record was created or last modified, compared to the last service update. These queries fully support the ability to cache a response, allowing common query results to be efficiently provided to users in a high demand service environment.When ingesting this service in your applications, avoid using POST requests whenever possible. These requests can compromise performance and scalability during periods of high usage because they too are not cacheable. Source: NASA LANCE - VNP14IMG_NRT active fire detection - WorldScale/Resolution: 375-meterUpdate Frequency: Hourly (depending on source availability) using the aggregated live feed methodologyArea Covered: WorldWhat can I do with this layer?This layer represents the most frequently updated and most detailed global remotely sensed wildfire information. Detection attributes include time, location, and intensity. It can be used to track the location of fires from the recent past, a few hours up to seven days behind real time. This layer also shows the location of wildfire over the past 7 days as a time-enabled service so that the progress of fires over that timeframe can be reproduced as an animation.The VIIRS thermal activity layer can be used to visualize and assess wildfires worldwide. However, it should be noted that this dataset contains many “false positives” (e.g., oil/natural gas wells or volcanoes) since the satellite will detect any large thermal signal.Fire points in this service are generally available within 3 1/4 hours after detection by a VIIRS device. LANCE estimates availability at around 3 hours after detection, and esri livefeeds check for updates every 20 minutes from LANCE.Even though these data display as point features, each point in fact represents a pixel that is >= 375 m high and wide. A point feature means somewhere in this pixel at least one "hot" spot was detected which may be a fire.VIIRS is a scanning radiometer device aboard the Suomi NPP, NOAA-20, and NOAA-21 satellites that collects imagery and radiometric measurements of the land, atmosphere, cryosphere, and oceans in several visible and infrared bands. The VIIRS Thermal Hotspots and Fire Activity layer is a livefeed from a subset of the overall VIIRS imagery, in particular from NASA's VNP14IMG_NRT active fire detection product. The source downloads are monitored automatically and retrieved from LANCE, NASA's near real time data and imagery site, every 20 minutes when updates detected.The 375-m data complements the 1-km Moderate Resolution Imaging Spectroradiometer (MODIS) Thermal Hotspots and Fire Activity layer; they both show good agreement in hotspot detection but the improved spatial resolution of the 375 m data provides a greater response over fires of relatively small areas and provides improved mapping of large fire perimeters.Attribute informationLatitude and Longitude: The center point location of the 375 m (approximately) pixel flagged as containing one or more fires/hotspots.Satellite: Whether the detection was picked up by the Suomi NPP satellite (N) or NOAA-20 satellite (1) or NOAA-21 satellite (2). For best results, use the virtual field WhichSatellite, redefined by an arcade expression, that gives the complete satellite name.Confidence: The detection confidence is a quality flag of the individual hotspot/active fire pixel. This value is based on a collection of intermediate algorithm quantities used in the detection process. It is intended to help users gauge the quality of individual hotspot/fire pixels. Confidence values are set to low, nominal and high. Low confidence daytime fire pixels are typically associated with areas of sun glint and lower relative temperature anomaly (<15K) in the mid-infrared channel I4. Nominal confidence pixels are those free of potential sun glint contamination during the day and marked by strong (>15K) temperature anomaly in either day or nighttime data. High confidence fire pixels are associated with day or nighttime saturated pixels.Please note: Low confidence nighttime pixels occur only over the geographic area extending from 11 deg E to 110 deg W and 7 deg N to 55 deg S. This area describes the region of influence of the South Atlantic Magnetic Anomaly which can cause spurious brightness temperatures in the mid-infrared channel I4 leading to potential false positive alarms. These have been removed from the NRT data distributed by FIRMS.FRP: Fire Radiative Power. Depicts the pixel-integrated fire radiative power in MW (MegaWatts). FRP provides information on the measured radiant heat output of detected fires. The amount of radiant heat energy liberated per unit time (the Fire Radiative Power) is thought to be related to the rate at which fuel is being consumed (Wooster et. al. (2005)).DayNight: D = Daytime fire, N = Nighttime fireHours Old: Derived field that provides age of record in hours between Acquisition date/time and latest update date/time. 0 = less than 1 hour ago, 1 = less than 2 hours ago, 2 = less than 3 hours ago, and so on. Additional information can be found on the NASA FIRMS site FAQ.Note about near real time data:Near real time data is not checked thoroughly before it's posted on LANCE or downloaded and posted to the Living Atlas. NASA's goal is to get vital fire information to its customers within three hours of observation time. However, the data is screened by a confidence algorithm which seeks to help users gauge the quality of individual hotspot/fire points. Low confidence daytime fire pixels are typically associated with areas of sun glint and lower relative temperature anomaly (<15K) in the mid-infrared channel I4. Medium confidence pixels are those free of potential sun glint contamination during the day and marked by strong (>15K) temperature anomaly in either day or nighttime data. High confidence fire pixels are associated with day or nighttime saturated pixels. RevisionsSeptember 10, 2025: Switched to alternate source site ‘firms2’ to get around data delivery delays on primary ‘firms’ site.March 7, 2024: Updated to include source data from NOAA-21 Satellite. September 15, 2022: Updated to include 'Hours_Old' field. Time series has been disabled by default, but still available. July 5, 2022: Terms of Use updated to Esri Master License Agreement, no longer stating that a subscription is required! This layer is provided for informational purposes and is not monitored 24/7 for accuracy and currency.If you would like to be alerted to potential issues or simply see when this Service will update next, please visit our Live Feed Status Page!
Kokoro Speech Dataset v1.1 Tiny
kaggle.com
zip
Updated May 14, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Katsuya Iida (2021). Kokoro Speech Dataset v1.1 Tiny [Dataset]. https://www.kaggle.com/datasets/kaiida/kokoro-speech-dataset-v11-tiny
Explore at:
zip(48156884 bytes)Available download formats
Dataset updated
May 14, 2021
Authors
Katsuya Iida
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Kokoro Speech Dataset

Kokoro Speech Dataset is a public domain Japanese speech dataset. It contains 34,958 short audio clips of a single speaker reading 9 novel books. The format of the metadata is similar to that of LJ Speech so that the dataset is compatible with modern speech synthesis systems.

The texts are from Aozora Bunko, which is in the public domain. The audio clips are from LibriVox project, which is also in the public domain. Readings are estimated by MeCab and UniDic Lite from kanji-kana mixture text. Readings are romanized which are similar to the format used by Julius.

The audio clips were split and transcripts were aligned automatically by Voice100.

Sample data

Listen from your browser or download randomly sampled 100 clips.

File Format

Metadata is provided in metadata.csv. This file consists of one record per line, delimited by the pipe character (0x7c). The fields are:

ID: this is the name of the corresponding .wav file

Transcription: Kanji-kana mixture text spoken by the reader (UTF-8)

Reading: Romanized text spoken by the reader (UTF-8)

Each audio file is a single-channel 16-bit PCM WAV with a sample rate of 22050 Hz.

Statistics

The dataset is provided in different sizes, large, small, tiny. small and tiny don't share same clips. large contains all available clips, including small and tiny.

Large: Total clips: 34958 Min duration: 3.007 secs Max duration: 14.745 secs Mean duration: 4.978 secs Total duration: 48:20:24 Small: Total clips: 8812 Min duration: 3.007 secs Max duration: 14.431 secs Mean duration: 4.951 secs Total duration: 12:07:12 Tiny: Total clips: 285 Min duration: 3.019 secs Max duration: 9.462 secs Mean duration: 4.871 secs Total duration: 00:23:08

How to get the data

Because of its large data size of the dataset, audio files are not included in this repository, but the metadata is included.

To make .wav files of the dataset, run

$ bash download.sh

to download the metadata from the project page. Then run

$ pip3 install torchaudio $ python3 extract.py --size tiny

This prints a shell script example to download MP3 audio files from archive.org and extract them if you haven't done it already.

After doing so, run the command again

$ python3 extract.py --size tiny

to get files for tiny under ./output directory.

You can give another size name to the --size option to get dataset of the size.

Pretrained Tacotron model

Audio Samples

Pretrained model

Pretrained Tacotron model trained with Kokoro Speech Dataset and audio samples are available. The model was trained for 21K steps with small. According to the above repo, "Speech started to become intelligible around 20K steps" with LJ Speech Dataset. Audio samples read the first few sentences from Gon Gitsune which is not included in small.

Books

The dataset contains recordings from these books read by ekzemplaro

明暗 (Meian) 16:39:29 Online text

こころ (Kokoro) 08:46:41 Online text

田舎教師 (Inaka Kyoshi) 08:13:26 Online text

野分 (Nowaki) 4:40:49 Online text

草枕 (Kusamakura) 04:27:35 Online text

坊っちゃん (Botchan) 04:26:27 Online text

雁 (Gan) 03:41:31 Online text

ごん狐 (Gon gitsune) 0:15:42 Online text

[コーカサスの禿鷹 (Caucasus no Hagetaka)](https://l...
A
Image
data.amerigeoss.org
csv, esri rest +2
Updated Jul 5, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AmeriGEO ArcGIS (2017). Image [Dataset]. https://data.amerigeoss.org/it/dataset/image
Explore at:
geojson, html, esri rest, csvAvailable download formats
Dataset updated
Jul 5, 2017
Dataset provided by
AmeriGEO ArcGIS
Description
Map Information
This nowCOAST time-enabled map service provides maps of NOAA/National Weather Service RIDGE2 mosaics of base reflectivity images across the Continental United States (CONUS) as well as Puerto Rico, Hawaii, Guam and Alaska with a 2 kilometer (1.25 mile) horizontal resolution. The mosaics are compiled by combining regional base reflectivity radar data obtained from 158 Weather Surveillance Radar 1988 Doppler (WSR-88D) also known as NEXt-generation RADar (NEXRAD) sites across the country operated by the NWS and the Dept. of Defense and also from data from Terminal Doppler Weather Radars (TDWR) at major airports. The colors on the map represent the strength of the energy reflected back toward the radar. The reflected intensities (echoes) are measured in dBZ (decibels of z). The color scale is very similar to the one used by the NWS RIDGE2 map viewer. The radar data itself is updated by the NWS every 10 minutes during non-precipitation mode, but every 4-6 minutes during precipitation mode. To ensure nowCOAST is displaying the most recent data possible, the latest mosaics are downloaded every 5 minutes. For more detailed information about the update schedule, see: http://new.nowcoast.noaa.gov/help/#section=updateschedule
Background Information
Reflectivity is related to the power, or intensity, of the reflected radiation that is sensed by the radar antenna. Reflectivity is expressed on a logarithmic scale in units called dBZ. The "dB" in the dBz scale is logarithmic and is unit less, but is used only to express a ratio. The "z" is the ratio of the density of water drops (measured in millimeters, raised to the 6th power) in each cubic meter (mm^6/m^3). When the "z" is large (many drops in a cubic meter), the reflected power is large. A small "z" means little returned energy. In fact, "z" can be less than 1 mm^6/m^3 and since it is logarithmic, dBz values will become negative, as often in the case when the radar is in clear air mode and indicated by earth tone colors. dBZ values are related to the intensity of rainfall. The higher the dBZ, the stronger the rain rate. A value of 20 dBZ is typically the point at which light rain begins. The values of 60 to 65 dBZ is about the level where 3/4 inch hail can occur. However, a value of 60 to 65 dBZ does not mean that severe weather is occurring at that location. The best reflectivity is lowest (1/2 degree elevation angle) reflectivity scan from the radar. The source of the base reflectivity mosaics is the NWS Southern Region Radar Integrated Display with Geospatial Elements (RIDGE2).
Time Information
This map is time-enabled, meaning that each individual layer contains time-varying data and can be utilized by clients capable of making map requests that include a time component.

This particular service can be queried with or without the use of a time component. If the time parameter is specified in a request, the data or imagery most relevant to the provided time value, if any, will be returned. If the time parameter is not specified in a request, the latest data or imagery valid for the present system time will be returned to the client. If the time parameter is not specified and no data or imagery is available for the present time, no data will be returned.

In addition to ArcGIS Server REST access, time-enabled OGC WMS 1.3.0 access is also provided by this service.

Due to software limitations, the time extent of the service and map layers displayed below does not provide the most up-to-date start and end times of available data. Instead, users have three options for determining the latest time information about the service:

Issue a returnUpdates=true request for an individual layer or for the service itself, which will return the current start and end times of available data, in epoch time format (milliseconds since 00:00 January 1, 1970). To see an example, click on the "Return Updates" link at the bottom of this page under "Supported Operations". Refer to the ArcGIS REST API Map Service Documentation for more information.

Issue an Identify (ArcGIS REST) or GetFeatureInfo (WMS) request against the proper layer corresponding with the target dataset. For raster data, this would be the "Image Footprints with Time Attributes" layer in the same group as the target "Image" layer being displayed. For vector (point, line, or polygon) data, the target layer can be queried directly. In either case, the attributes returned for the matching raster(s) or vector feature(s) will include the following:

validtime: Valid timestamp.

starttime: Display start time.

endtime: Display end time.

reftime: Reference time (sometimes reffered to as issuance time, cycle time, or initialization time).

projmins: Number of minutes from reference time to valid time.

desigreftime: Designated reference time; used as a common reference time for all items when individual reference times do not match.

desigprojmins: Number of minutes from designated reference time to valid time.

Query the nowCOAST LayerInfo web service, which has been created to provide additional information about each data layer in a service, including a list of all available "time stops" (i.e. "valid times"), individual timestamps, or the valid time of a layer's latest available data (i.e. "Product Time"). For more information about the LayerInfo web service, including examples of various types of requests, refer to the nowCOAST help documentation at: http://new.nowcoast.noaa.gov/help/#section=layerinfo

References

NWS, 2003: NWS Product Description Document for Radar Integrated Display with Geospatial Elements Version 2- RIDGE2, NWS/SRH, Fort Worth, Texas (Available at http://products.weather.gov/PDD/RIDGE_II_PDD_ver2.pdf).

NWS, 2013: Radar Images for GIS Software (http://www.srh.noaa.gov/jetstream/doppler/gis.htm).
Scalable Unsupervised Learning for Unmanned Exploration
data.nasa.gov
application/rdfxml +5
Updated Jun 26, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2018). Scalable Unsupervised Learning for Unmanned Exploration [Dataset]. https://data.nasa.gov/dataset/Scalable-Unsupervised-Learning-for-Unmanned-Explor/e4bu-pstu
Explore at:
xml, json, csv, application/rdfxml, application/rssxml, tsvAvailable download formats
Dataset updated
Jun 26, 2018
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
Though we dream of the day when humans will first walk on Mars, these dreams remain in the distance. For now, we explore vicariously by sending robotic agents like the Curiosity rover in our stead. Though our current robotic systems are extremely capable, they lack perceptual common sense. This characteristic will be increasingly needed as we create robotic extensions of humanity to reach across the stars, for several reasons. First, robots can go places that humans cannot. If we manage to get a human on Mars by 2035, as predicted by the current NASA timeline, this will still represent a 60 year lag from the time of the first robotic lander. Second, while it is possible to replace common sense in robots with human teleoperated control to some extent, this becomes infeasible as the distance to the base planet and the associated radio signal delay increase. Finally, as we pack more and more sensors onboard, the fraction of data that can be sent back to earth decreases. Data triage (finding the few frames containing a curious object on a planet's surface out of terabytes of data) becomes more important.

In the last few years, research into a class of scalable unsupervised algorithms, also called deep learning algorithms, has blossomed, in part due to state of the art performance in a number of areas. A common thread among many recent deep learning algorithms is that they tend to represent the world in ways similar to how our brains represent the world. For example, thanks to decades of work by neuroscientists, we now know that in the V1 area of the visual cortex, the first region that visual information passes through after the retina, neurons tune themselves to respond to oriented edges and do so in a way that groups them together based on similarity. With this behavior as a goal, researchers set out to devise simple algorithms that reproduce this effect. It turns out that there are several. One, known as Topographic Independent Component Analysis, has each neuron start with random connections and then look for patterns that are statistically out of the ordinary. When it finds one, it locks onto this pattern, discouraging other neurons from duplicating its findings but simultaneously trying to group itself with other neurons that have learned patterns which are similar, but not identical.

My proposed research plan is to develop existing and new unsupervised learning algorithms of this type and apply them to a robotic system. Specifically, I will demonstrate a prototype system capable of (1) learning about itself and its environment and of (2) actively carrying out experiments to learn more about itself and its environment. Research will be kept focused by developing a system aimed at eventual deployment on an unmanned space mission. Key components of the project will include synthetic data experiments, experiments on data recorded from a real robot, and finally experiments with learning in the loop as the robot explores its environment and learns actively.

The unsupervised algorithms in question are applicable not only to a single domain, but to creating models for a wide range of applications. Thus, advances are likely to have far-reaching implications for many areas of autonomous space exploration. Tantalizing though this is, it is equally exciting that unsupervised learning is already finding application with surprisingly impressive performance right now, indicating great promise for near-term application to unmanned space exploration.
d
Intersects for Nantucket, Massachusetts, generated to calculate shoreline...
catalog.data.gov
s.cnmilf.com
Updated Nov 20, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Intersects for Nantucket, Massachusetts, generated to calculate shoreline change rates using the Digital Shoreline Analysis System version 5.0 [Dataset]. https://catalog.data.gov/dataset/intersects-for-nantucket-massachusetts-generated-to-calculate-shoreline-change-rates-using
Explore at:
Dataset updated
Nov 20, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Nantucket, Massachusetts
Description
The Massachusetts Office of Coastal Zone Management launched the Shoreline Change Project in 1989 to identify erosion-prone areas of the coast. The shoreline position and change rate are used to inform management decisions regarding the erosion of coastal resources. In 2001, a shoreline from 1994 was added to calculate both long- and short-term shoreline change rates along ocean-facing sections of the Massachusetts coast. In 2013, two oceanfront shorelines for Massachusetts were added using 2008-9 color aerial orthoimagery and 2007 topographic lidar datasets obtained from the National Oceanic and Atmospheric Administration's Ocean Service, Coastal Services Center. This 2018 data release includes rates that incorporate two new mean high water (MHW) shorelines for the Massachusetts coast extracted from lidar data collected between 2010 and 2014. The first new shoreline for the State includes data from 2010 along the North Shore and South Coast from lidar data collected by the U.S. Army Corps of Engineers (USACE) Joint Airborne Lidar Bathymetry Technical Center of Expertise. Shorelines along the South Shore and Outer Cape are from 2011 lidar data collected by the U.S. Geological Survey's (USGS) National Geospatial Program Office. Shorelines along Nantucket and Martha’s Vineyard are from a 2012 USACE Post Sandy Topographic lidar survey. The second new shoreline for the North Shore, Boston, South Shore, Cape Cod Bay, Outer Cape, South Cape, Nantucket, Martha’s Vineyard, and the South Coast (around Buzzards Bay to the Rhode Island Border) is from 2013-14 lidar data collected by the (USGS) Coastal and Marine Geology Program. This 2018 update of the rate of shoreline change in Massachusetts includes two types of rates. Some of the rates include a proxy-datum bias correction, this is indicated in the filename with “PDB”. The rates that do not account for this correction have “NB” in their file names. The proxy-datum bias is applied because in some areas a proxy shoreline (like a High Water Line shoreline) has a bias when compared to a datum shoreline (like a Mean High Water shoreline). In areas where it exists, this bias should be accounted for when calculating rates using a mix of proxy and datum shorelines. This issue is explained further in Ruggiero and List (2009) and in the process steps of the metadata associated with the rates. This release includes both long-term (~150 years) and short term (~30 years) rates. Files associated with the long-term rates have “LT” in their names, files associated with short-term rates have “ST” in their names.
i07 PreSGMA GroundWaterManagementPlans
catalog.data.gov
data.cnra.ca.gov
+9more
Updated Jul 24, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
California Department of Water Resources (2025). i07 PreSGMA GroundWaterManagementPlans [Dataset]. https://catalog.data.gov/dataset/i07-presgma-groundwatermanagementplans-d3d5c
Explore at:
Dataset updated
Jul 24, 2025
Dataset provided by
California Department of Water Resourceshttp://www.water.ca.gov/
Description
This polygon feature class is a data set compiled by DWR employees in 2013 and represents the statewide Groundwater Management Plan (Plan) boundaries predating the Sustainable Groundwater Management Act (SGMA) requirements. Each polygon represents the area in which a Plan is to be implemented. The boundaries were provided to DWR by the affiliated public agency and compiled into a single statewide data set. Spatial plan boundaries were provided by agencies to DWR either via shapefiles or PDFs. PDFs were georeferenced and turned into GIS layers by DWR employees. This feature class is for legacy purposes only and will not be changed nor updated. It needs to be memorialized for spatial coverage of Groundwater Management Plans prior to SGMA and because SGMA only requires medium and high priority basins to have a Groundwater Sustainability Plan. The Plans outlined in this shapefile by medium and high priority basins are in effect until SGMA goes into effect. Some low and very low priority basins will likely use the existing plans to get funding for future basin management (since it is only voluntary for them to provide a Plan under SGMA, but they already have one in place). The data set is considered complete because of its legacy status. However, anyone using the data set will notice boundary inconsistencies, agency plan overlaps, mismatches, and other topology errors. The data set is based on boundary estimations and in the cases of medium and high priority basins will be outdated with in implementation of SGMA.The associated data are considered DWR enterprise GIS data, which meet all appropriate requirements of the DWR Spatial Data Standards, specifically the DWR Spatial Data Standard version 3.1, dated September 11, 2019. This data set was not produced by DWR. Data were originally developed and supplied by each individual plan agency and compiled by DWR. DWR makes no warranties or guarantees — either expressed or implied— as to the completeness, accuracy, or correctness of the data. DWR neither accepts nor assumes liability arising from or for any incorrect, incomplete, or misleading subject data. Comments, problems, improvements, updates, or suggestions should be forwarded to GIS@water.ca.gov.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

roblex nana (2020). The bAbI tasks data [Dataset]. https://www.kaggle.com/roblexnana/the-babi-tasks-for-nlp-qa-system

The bAbI tasks data

Towards AI Complete Question Answering: A Set of Prerequisite Toy Tasks

Explore at:

zip(17499991 bytes)Available download formats

Dataset updated

May 16, 2020

Authors

roblex nana

License

Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically

Description

Context

This dataset presents the first set of 20 tasks for testing text understanding and reasoning in the bAbI project. The tasks are described in detail in the paper: Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M. Rush, Bart van Merriënboer, Armand Joulin and Tomas Mikolov. Towards AI Complete Question Answering: A Set of Prerequisite Toy Tasks, arXiv:1502.05698.

Please also see the following slides: Antoine Bordes Artificial Tasks for Artificial Intelligence, ICLR keynote, 2015.

The aim is that each task tests a unique aspect of text and reasoning, and hence test different capabilities of learning models. More tasks are planned in the future to capture more aspects.

Content

Training Set Size: For each task, there are 1000 questions for training, and 1000 for testing. However, we emphasize that the goal is to use as little data as possible to do well on the tasks (i.e. if you can use less than 1000 that’s even better) — and without resorting to engineering task-specific tricks that will not generalize to other tasks, as they may not be of much use subsequently. Note that the aim during evaluation is to use the same learner across all tasks to evaluate its skills and capabilities.

Supervision Signal: Further while the MemNN results in the paper use full supervision (including of the supporting facts) results with weak supervision would also be ultimately preferable as this kind of data is easier to collect. Hence results of that form are very welcome. E.g. this paper does include weakly supervised results.

For the reasons above there are currently several directories:

1) en/ — the tasks in English, readable by humans.
2) hn/ — the tasks in Hindi, readable by humans.
3) shuffled/ — the same tasks with shuffled letters so they are not readable by humans, and for existing parsers and taggers cannot be used in a straight-forward fashion to leverage extra resources– in this case the learner is more forced to rely on the given training data. This mimics a learner being first presented a language and having to learn from scratch.
4) en-10k/ shuffled-10k/ and hn-10k/ — the same tasks in the three formats, but with 10,000 training examples, rather than 1000 training examples. Note the results in the paper use 1000 training examples.

Versions: Some small updates since the original release have been made (see the README in the data download for more details). You can also get v1.0 and v1.1 here.

Acknowledgements

We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

Inspiration

The aim is to encourage the machine learning community to work on, and develop more, of these tasks.

References

https://research.fb.com/downloads/babi/

Clear search

Close search

Google apps

Main menu

The bAbI tasks data

Context

Content

Acknowledgements

Inspiration

References

Oceanographic XBT Data Device Location for Joint USGS Cruise 03008 and NOAA...

Cannabis - Consumer ,Producer dataset

Context

Content

Data from: Estimation of Unit Process Data for Life Cycle Assessment Using a...

High resolution spectral data predicts taxonomic diversity in low diversity...

Satellite (VIIRS) Thermal Hotspots and Fire Activity

Kokoro Speech Dataset v1.1 Tiny

Kokoro Speech Dataset

Sample data

File Format

Statistics

How to get the data

Pretrained Tacotron model

Books

Image

Scalable Unsupervised Learning for Unmanned Exploration

Intersects for Nantucket, Massachusetts, generated to calculate shoreline...

i07 PreSGMA GroundWaterManagementPlans

The bAbI tasks data

Towards AI Complete Question Answering: A Set of Prerequisite Toy Tasks

Context

Content

Acknowledgements

Inspiration

References