Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context
The dataset presents median household incomes for various household sizes in Combine, TX, as reported by the U.S. Census Bureau. The dataset highlights the variation in median household income with the size of the family unit, offering valuable insights into economic trends and disparities within different household sizes, aiding in data analysis and decision-making.
Key observations
https://i.neilsberg.com/ch/combine-tx-median-household-income-by-household-size.jpeg" alt="Combine, TX median household income, by household size (in 2022 inflation-adjusted dollars)">
When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates.
Household Sizes:
Variables / Data Columns
Good to know
Margin of Error
Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.
Custom data
If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.
Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.
This dataset is a part of the main dataset for Combine median household income. You can refer the same here
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Glycan arrays are indispensable for learning about the specificities of glycan-binding proteins. Despite the abundance of available data, the current analysis methods do not have the ability to interpret and use the variety of data types and to integrate information across datasets. Here, we evaluated whether a novel, automated algorithm for glycan-array analysis could meet that need. We developed a regression-tree algorithm with simultaneous motif optimization and packaged it in software called MotifFinder. We applied the software to analyze data from eight different glycan-array platforms with widely divergent characteristics and observed an accurate analysis of each dataset. We then evaluated the feasibility and value of the combined analyses of multiple datasets. In an integrated analysis of datasets covering multiple lectin concentrations, the software determined approximate binding constants for distinct motifs and identified major differences between the motifs that were not apparent from single-concentration analyses. Furthermore, an integrated analysis of data sources with complementary sets of glycans produced broader views of lectin specificity than produced by the analysis of just one data source. MotifFinder, therefore, enables the optimal use of the expanding resource of the glycan-array data and promises to advance the studies of protein–glycan interactions.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context
The dataset tabulates the Combine population distribution across 18 age groups. It lists the population in each age group along with the percentage population relative of the total population for Combine. The dataset can be utilized to understand the population distribution of Combine by age. For example, using this dataset, we can identify the largest age group in Combine.
Key observations
The largest age group in Combine, TX was for the group of age 5 to 9 years years with a population of 311 (11.19%), according to the ACS 2019-2023 5-Year Estimates. At the same time, the smallest age group in Combine, TX was the 85 years and over years with a population of 6 (0.22%). Source: U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates
When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates
Age groups:
Variables / Data Columns
Good to know
Margin of Error
Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.
Custom data
If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.
Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.
This dataset is a part of the main dataset for Combine Population by Age. You can refer the same here
A broad and generalized selection of 2014-2018 US Census Bureau 2018 5-year American Community Survey population data estimates, obtained via Census API and joined to the appropriate geometry (in this case, New Mexico Census tracts). The selection is not comprehensive, but allows a first-level characterization of total population, male and female, and both broad and narrowly-defined age groups. In addition to the standard selection of age-group breakdowns (by male or female), the dataset provides supplemental calculated fields which combine several attributes into one (for example, the total population of persons under 18, or the number of females over 65 years of age). The determination of which estimates to include was based upon level of interest and providing a manageable dataset for users.The U.S. Census Bureau's American Community Survey (ACS) is a nationwide, continuous survey designed to provide communities with reliable and timely demographic, housing, social, and economic data every year. The ACS collects long-form-type information throughout the decade rather than only once every 10 years. The ACS combines population or housing data from multiple years to produce reliable numbers for small counties, neighborhoods, and other local areas. To provide information for communities each year, the ACS provides 1-, 3-, and 5-year estimates. ACS 5-year estimates (multiyear estimates) are “period” estimates that represent data collected over a 60-month period of time (as opposed to “point-in-time” estimates, such as the decennial census, that approximate the characteristics of an area on a specific date). ACS data are released in the year immediately following the year in which they are collected. ACS estimates based on data collected from 2009–2014 should not be called “2009” or “2014” estimates. Multiyear estimates should be labeled to indicate clearly the full period of time. While the ACS contains margin of error (MOE) information, this dataset does not. Those individuals requiring more complete data are directed to download the more detailed datasets from the ACS American FactFinder website. This dataset is organized by Census tract boundaries in New Mexico. Census tracts are small, relatively permanent statistical subdivisions of a county or equivalent entity, and were defined by local participants as part of the 2010 Census Participant Statistical Areas Program. The primary purpose of census tracts is to provide a stable set of geographic units for the presentation of census data and comparison back to previous decennial censuses. Census tracts generally have a population size between 1,200 and 8,000 people, with an optimum size of 4,000 people. State and county boundaries always are census tract boundaries in the standard census geographic hierarchy. In a few rare instances, a census tract may consist of noncontiguous areas. These noncontiguous areas may occur where the census tracts are coextensive with all or parts of legal entities that are themselves noncontiguous. For the 2010 Census, the census tract code range of 9400 through 9499 was enforced for census tracts that include a majority American Indian population according to Census 2000 data and/or their area was primarily covered by federally recognized American Indian reservations and/or off-reservation trust lands; the code range 9800 through 9899 was enforced for those census tracts that contained little or no population and represented a relatively large special land use area such as a National Park, military installation, or a business/industrial park; and the code range 9900 through 9998 was enforced for those census tracts that contained only water area, no land area.
https://object-store.os-api.cci2.ecmwf.int:443/cci2-prod-catalogue/licences/licence-to-use-copernicus-products/licence-to-use-copernicus-products_b4b9451f54cffa16ecef5c912c9cebd6979925a956e3fa677976e0cf198c2c18.pdfhttps://object-store.os-api.cci2.ecmwf.int:443/cci2-prod-catalogue/licences/licence-to-use-copernicus-products/licence-to-use-copernicus-products_b4b9451f54cffa16ecef5c912c9cebd6979925a956e3fa677976e0cf198c2c18.pdf
ERA5 is the fifth generation ECMWF reanalysis for the global climate and weather for the past 8 decades. Data is available from 1940 onwards. ERA5 replaces the ERA-Interim reanalysis. Reanalysis combines model data with observations from across the world into a globally complete and consistent dataset using the laws of physics. This principle, called data assimilation, is based on the method used by numerical weather prediction centres, where every so many hours (12 hours at ECMWF) a previous forecast is combined with newly available observations in an optimal way to produce a new best estimate of the state of the atmosphere, called analysis, from which an updated, improved forecast is issued. Reanalysis works in the same way, but at reduced resolution to allow for the provision of a dataset spanning back several decades. Reanalysis does not have the constraint of issuing timely forecasts, so there is more time to collect observations, and when going further back in time, to allow for the ingestion of improved versions of the original observations, which all benefit the quality of the reanalysis product. This catalogue entry provides post-processed ERA5 hourly single-level data aggregated to daily time steps. In addition to the data selection options found on the hourly page, the following options can be selected for the daily statistic calculation:
The daily aggregation statistic (daily mean, daily max, daily min, daily sum*) The sub-daily frequency sampling of the original data (1 hour, 3 hours, 6 hours) The option to shift to any local time zone in UTC (no shift means the statistic is computed from UTC+00:00)
*The daily sum is only available for the accumulated variables (see ERA5 documentation for more details). Users should be aware that the daily aggregation is calculated during the retrieval process and is not part of a permanently archived dataset. For more details on how the daily statistics are calculated, including demonstrative code, please see the documentation. For more details on the hourly data used to calculate the daily statistics, please refer to the ERA5 hourly single-level data catalogue entry and the documentation found therein.
This dataset is a compilation of address point data for the City of Tempe. The dataset contains a point location, the official address (as defined by The Building Safety Division of Community Development) for all occupiable units and any other official addresses in the City. There are several additional attributes that may be populated for an address, but they may not be populated for every address. Contact: Lynn Flaaen-Hanna, Development Services Specialist Contact E-mail Link: Map that Lets You Explore and Export Address Data Data Source: The initial dataset was created by combining several datasets and then reviewing the information to remove duplicates and identify errors. This published dataset is the system of record for Tempe addresses going forward, with the address information being created and maintained by The Building Safety Division of Community Development. Data Source Type: ESRI ArcGIS Enterprise Geodatabase Preparation Method: N/A Publish Frequency: Weekly Publish Method: Automatic Data Dictionary
A broad and generalized selection of 2013-2017 US Census Bureau 2017 5-year American Community Survey population data estimates, obtained via Census API and joined to the appropriate geometry (in this case, New Mexico counties). The selection is not comprehensive, but allows a first-level characterization of total population, male and female, and both broad and narrowly-defined age groups. In addition to the standard selection of age-group breakdowns (by male or female), the dataset provides supplemental calculated fields which combine several attributes into one (for example, the total population of persons under 18, or the number of females over 65 years of age). The determination of which estimates to include was based upon level of interest and providing a manageable dataset for users.The U.S. Census Bureau's American Community Survey (ACS) is a nationwide, continuous survey designed to provide communities with reliable and timely demographic, housing, social, and economic data every year. The ACS collects long-form-type information throughout the decade rather than only once every 10 years. As in the decennial census, strict confidentiality laws protect all information that could be used to identify individuals or households.The ACS combines population or housing data from multiple years to produce reliable numbers for small counties, neighborhoods, and other local areas. To provide information for communities each year, the ACS provides 1-, 3-, and 5-year estimates. ACS 5-year estimates (multiyear estimates) are “period” estimates that represent data collected over a 60-month period of time (as opposed to “point-in-time” estimates, such as the decennial census, that approximate the characteristics of an area on a specific date). ACS data are released in the year immediately following the year in which they are collected. ACS estimates based on data collected from 2009–2014 should not be called “2009” or “2014” estimates. Multiyear estimates should be labeled to indicate clearly the full period of time. The primary advantage of using multiyear estimates is the increased statistical reliability of the data for less populated areas and small population subgroups. Data are based on a sample and are subject to sampling variability. The degree of uncertainty for an estimate arising from sampling variability is represented through the use of a margin of error. While each full Data Profile contains margin of error (MOE) information, this dataset does not. Those individuals requiring more complete data are directed to download the more detailed datasets from the ACS American FactFinder website. This dataset is organized by New Mexico county boundaries.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset was constructed to compare the performance of various neural network architectures learning the flow maps of Hamiltonian systems. It was created for the paper: A Generalized Framework of Neural Networks for Hamiltonian Systems.
The dataset consists of trajectory data from three different Hamiltonian systems. Namely, the single pendulum, double pendulum and 3-body problem. The data was generated using numerical integrators. For the single pendulum, the symplectic Euler method with a step size of 0.01 was used. The data of the double pendulum was also computed by the symplectic Euler method, however, with an adaptive step size. The trajectories of the 3-body problem were calculated by the arbitrarily high-precision code Brutus.
For each Hamiltonian system, there is one file containing the entire trajectory information (*_all_runs.h5.1). In these files, the states along all trajectories are recorded with a step size of 0.01. These files are composed of several Pandas DataFrames. One DataFrame per trajectory, called "run0", "run1", ... and finally one large DataFrame in which all the trajectories are combined, called "all_runs". Additionally, one Pandas Series called "constants" is contained in these files, in which several parameters of the data are listed.
Also, there is a second file per Hamiltonian system in which the data is prepared as features and labels ready for neural networks to be trained (*_training.h5.1). Similar to the first type of files, they contain a Series called "constants". The features and labels are then separated into 6 DataFrames called "features", "labels", "val_features", "val_labels", "test_features" and "test_labels". The data is split into 80% training data, 10% validation data and 10% test data.
The code used to train various neural network architectures on this data can be found on GitHub at: https://github.com/AELITTEN/GHNN.
Already trained neural networks can be found on GitHub at: https://github.com/AELITTEN/NeuralNets_GHNN.
Single pendulum Double pendulum 3-body problem
Number of trajectories 500 2000 5000
final time in all_runs T (one period of the pendulum) 10 10
final time in training data 0.25*T 5 5
step size in training data 0.1 0.1 0.5
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BirdVox-scaper-10k: a synthetic dataset for multilabel species classification of flight calls from 10-second audio recordings
=============================================================================================
Version 1.0, September 2019.
Created By
-------------
Elizabeth Mendoza (1), Vincent Lostanlen (2, 3, 4), Justin Salamon (3, 4), Andrew Farnsworth (2), Steve Kelling (2), and Juan Pablo Bello (3, 4).
(1): Forest Hills High School, New York, NY, USA
(2): Cornell Lab of Ornithology, Cornell University, Ithaca, NY, USA
(3): Center for Urban Science and Progress, New York University, New York, NY, USA
(4): Music and Audio Research Lab, New York University, New York, NY, USA
Description
--------------
The BirdVox-scaper-10k dataset contains 9983 artificial soundscapes. Each soundscape lasts exactly ten seconds and contains one or several avian flight calls from up to 30 different species of New World warblers (Parulidae). Alongside each audio file, we include an annotation file describing the start time and end time of each flight call in the corresponding soundscape, as well as the species of warbler it belongs to.
In order to synthesize soundscapes in BirdVox-scaper-10k, we mixed natural sounds from various pre-recorded sources. First, we extracted isolated recordings of flight calls containing little or no background noise from the CLO-43SD dataset [1]. Secondly, we extracted 10-second "empty" acoustic scenes from the BirdVox-DCASE-20k dataset [2]. These acoustic scenes contain various sources of real-world background noise, including biophony (insects) and anthropophony (vehicles), yet are guaranteed to be devoid of any flight calls. Lastly, we "fill" each acoustic scene by mixing it with flight calls sampled at random.
Although the BirdVox-scaper-10k does not consist of natural recordings, we have taken several measures to ensure the plausibility of each synthesized soundscape, both from qualitative and quantitative standpoints.
The BirdVox-scaper-10k dataset can be used, among other things, for the research, development, and testing of bioacoustic classification models.
For details on the hardware of ROBIN recording units, we refer the reader to [2].
[1] J. Salamon, J. Bello. Fusing shallow and deep learning for bioacoustic bird species classification. Proc. IEEE ICASSP, 2017.
[2] V. Lostanlen, J. Salamon, A. Farnsworth, S. Kelling, and J. Bello. BirdVox-full-night: a dataset and benchmark for avian flight call detection. Proc. IEEE ICASSP, 2018.
[3] J. Salamon, J. P. Bello, A. Farnsworth, M. Robbins, S. Keen, H. Klinck, and S. Kelling. Towards the Automatic Classification of Avian Flight Calls for Bioacoustic Monitoring. PLoS One, 2016.
@inproceedings{lostanlen2018icassp,
title = {BirdVox-full-night: a dataset and benchmark for avian flight call detection},
author = {Lostanlen, Vincent and Salamon, Justin and Farnsworth, Andrew and Kelling, Steve and Bello, Juan Pablo},
booktitle = {Proc. IEEE ICASSP},
year = {2018},
published = {IEEE},
venue = {Calgary, Canada},
month = {April},
}
Coastal habitats are utilized and altered for a suite of human uses. Habitat modification is here defined as the alteration or removal of geomorphic structure as a result of human use. This includes several habitat-modifying features like seawalls, piers, breakwaters, dredged areas, artificial land (i.e., filled wetlands), and offshore structures. This data layer represents the presence of habitat modification in shallow waters of the Main Hawaiian Islands. The Ocean Tipping Points (OTP) project mapped the presence of habitat-modifying features by combining several existing datasets derived primarily from satellite and aerial imagery, including the following datasets: benthic habitat maps (NOAA Center for Coastal Monitoring and Assessment (CCMA), 2007); NOAA Environmental Sensitivity Index (ESI) line data (NOAA Office of Response and Restoration (OR&R), 2001); maintained channels (NOAA, US Army Corps of Engineers (USACE), MarineCadastre.gov); and locations of offshore aquaculture. The layer represents the presence or absence of habitat modification, with a cell size of 500 m. Relevant man-made features were extracted from each individual dataset and saved (features classified as artificial and dredged areas in NOAA benthic habitat maps; coastal segments designated as man-made structures and riprap in NOAA ESI line data; all features from the maintained channels and aquaculture datasets). The resulting polygon datasets were merged together. A field was added to all vector layers with a value of 1 for each feature to represent the presence of habitat modification. Vector data were then converted to 500-m rasters and combined into a mosaic.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context
The dataset tabulates the population of Combined Locks by race. It includes the population of Combined Locks across racial categories (excluding ethnicity) as identified by the Census Bureau. The dataset can be utilized to understand the population distribution of Combined Locks across relevant racial categories.
Key observations
The percent distribution of Combined Locks population by race (across all racial categories recognized by the U.S. Census Bureau): 87.67% are white, 4.51% are Black or African American, 3.85% are some other race and 3.96% are multiracial.
When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.
Racial categories include:
Variables / Data Columns
Good to know
Margin of Error
Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.
Custom data
If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.
Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.
This dataset is a part of the main dataset for Combined Locks Population by Race & Ethnicity. You can refer the same here
This dataset is a modified version of the FWS developed data depicting “Highly Important Landscapes”, as outlined in Memorandum FWS/AES/058711 and provided to the Wildlife Habitat Spatial analysis Lab on October 29th 2014. Other names and acronyms used to refer to this dataset have included: Areas of Significance (AoSs - name of GIS data set provided by FWS), Strongholds (FWS), and Sagebrush Focal Areas (SFAs - BLM). The BLM will refer to these data as Sagebrush Focal Areas (SFAs). Data were provided as a series of ArcGIS map packages which, when extracted, contained several datasets each. Based on the recommendation of the FWS Geographer/Ecologist (email communication, see data originator for contact information) the dataset called “Outiline_AreasofSignificance” was utilized as the source for subsequent analysis and refinement. Metadata was not provided by the FWS for this dataset. For detailed information regarding the dataset’s creation refer to Memorandum FWS/AES/058711 or contact the FWS directly. Several operations and modifications were made to this source data, as outlined in the “Description” and “Process Step” sections of this metadata file. Generally: The source data was named by the Wildlife Habitat Spatial Analysis Lab to identify polygons as described (but not identified in the GIS) in the FWS memorandum. The Nevada/California EIS modified portions within their decision space in concert with local FWS personnel and provided the modified data back to the Wildlife Habitat Spatial Analysis Lab. Gaps around Nevada State borders, introduced by the NVCA edits, were then closed as was a large gap between the southern Idaho & southeast Oregon present in the original dataset. Features with an area below 40 acres were then identified and, based on FWS guidance, either removed or retained. Finally, guidance from BLM WO resulted in the removal of additional areas, primarily non-habitat with BLM surface or subsurface management authority. Data were then provided to each EIS for use in FEIS development. Based on guidance from WO, SFAs were to be limited to BLM decision space (surface/sub-surface management areas) within PHMA. Each EIS was asked to provide the limited SFA dataset back to the National Operations Center to ensure consistent representation and analysis. Returned SFA data, modified by each individual EIS, was then consolidated at the BLM’s National Operations Center retaining the three standardized fields contained in this dataset.Several Modifications from the original FWS dataset have been made. Below is a summary of each modification.1. The data as received from FWS: 16,514,163 acres & 1 record.2. Edited to name SFAs by Wildlife Habitat Spatial Analysis Lab:Upon receipt of the “Outiline_AreasofSignificance” dataset from the FWS, a copy was made and the one existing & unnamed record was exploded in an edit session within ArcMap. A text field, “AoS_Name”, was added. Using the maps provided with Memorandum FWS/AES/058711, polygons were manually selected and the “AoS_Name” field was calculated to match the names as illustrated. Once all polygons in the exploded dataset were appropriately named, the dataset was dissolved, resulting in one record representing each of the seven SFAs identified in the memorandum.3. The NVCA EIS made modifications in concert with local FWS staff. Metadata and detailed change descriptions were not returned with the modified data. Contact Leisa Wesch, GIS Specialist, BLM Nevada State Office, 775-861-6421, lwesch@blm.gov, for details.4. Once the data was returned to the Wildlife Habitat Spatial Analysis Lab from the NVCA EIS, gaps surrounding the State of NV were closed. These gaps were introduced by the NVCA edits, exacerbated by them, or existed in the data as provided by the FWS. The gap closing was performed in an edit session by either extending each polygon towards each other or by creating a new polygon, which covered the gap, and merging it with the existing features. In addition to the gaps around state boundaries, a large area between the S. Idaho and S.E. Oregon SFAs was filled in. To accomplish this, ADPP habitat (current as of January 2015) and BLM GSSP SMA data were used to create a new polygon representing PHMA and BLM management that connected the two existing SFAs.5. In an effort to simplify the FWS dataset, features whose areas were less than 40 acres were identified and FWS was consulted for guidance on possible removal. To do so, features from #4 above were exploded once again in an ArcMap edit session. Features whose areas were less than forty acres were selected and exported (770 total features). This dataset was provided to the FWS and then returned with specific guidance on inclusion/exclusion via email by Lara Juliusson (lara_juliusson@fws.gov). The specific guidance was:a. Remove all features whose area is less than 10 acresb. Remove features identified as slivers (the thinness ratio was calculated and slivers identified by Lara Juliusson according to https://tereshenkov.wordpress.com/2014/04/08/fighting-sliver-polygons-in-arcgis-thinness-ratio/) and whose area was less than 20 acres.c. Remove features with areas less than 20 acres NOT identified as slivers and NOT adjacent to other features.d. Keep the remainder of features identified as less than 40 acres.To accomplish “a” and “b”, above, a simple selection was applied to the dataset representing features less than 40 acres. The select by location tool was used, set to select identical, to select these features from the dataset created in step 4 above. The records count was confirmed as matching between the two data sets and then these features were deleted. To accomplish “c” above, a field (“AdjacentSH”, added by FWS but not calculated) was calculated to identify features touching or intersecting other features. A series of selections was used: first to select records 6. Based on direction from the BLM Washington Office, the portion of the Upper Missouri River Breaks National Monument (UMRBNM) that was included in the FWS SFA dataset was removed. The BLM NOC GSSP NLCS dataset was used to erase these areas from #5 above. Resulting sliver polygons were also removed and geometry was repaired.7. In addition to removing UMRBNM, the BLM Washington Office also directed the removal of Non-ADPP habitat within the SFAs, on BLM managed lands, falling outside of Designated Wilderness’ & Wilderness Study Areas. An exception was the retention of the Donkey Hills ACEC and adjacent BLM lands. The BLM NOC GSSP NLCS datasets were used in conjunction with a dataset containing all ADPP habitat, BLM SMA and BLM sub-surface management unioned into one file to identify and delete these areas.8. The resulting dataset, after steps 2 – 8 above were completed, was dissolved to the SFA name field yielding this feature class with one record per SFA area.9. Data were provided to each EIS for use in FEIS allocation decision data development.10. Data were subset to BLM decision space (surface/sub-surface) within PHMA by each EIS and returned to the NOC.11. Due to variations in field names and values, three standardized fields were created and calculated by the NOC:a. SFA Name – The name of the SFA.b. Subsurface – Binary “Yes” or “No” to indicated federal subsurface estate.c. SMA – Represents BLM, USFS, other federal and non-federal surface management 12. The consolidated data (with standardized field names and values) were dissolved on the three fields illustrated above and geometry was repaired, resulting in this dataset.
General overview The following datasets are described by this metadata record, and are available for download from the provided URL.
####
Physical parameters raw log files
Raw log files 1) DATE= 2) Time= UTC+11 3) PROG=Automated program to control sensors and collect data 4) BAT=Amount of battery remaining 5) STEP=check aquation manual 6) SPIES=check aquation manual 7) PAR=Photoactive radiation 8) Levels=check aquation manual 9) Pumps= program for pumps 10) WQM=check aquation manual
####
Respiration/PAM chamber raw excel spreadsheets
Abbreviations in headers of datasets Note: Two data sets are provided in different formats. Raw and cleaned (adj). These are the same data with the PAR column moved over to PAR.all for analysis. All headers are the same. The cleaned (adj) dataframe will work with the R syntax below, alternative add code to do cleaning in R.
Date: ISO 1986 - Check Time:UTC+11 unless otherwise stated DATETIME: UTC+11 unless otherwise stated ID (of instrument in respiration chambers) ID43=Pulse amplitude fluoresence measurement of control ID44=Pulse amplitude fluoresence measurement of acidified chamber ID=1 Dissolved oxygen ID=2 Dissolved oxygen ID3= PAR ID4= PAR PAR=Photo active radiation umols F0=minimal florescence from PAM Fm=Maximum fluorescence from PAM Yield=(F0 – Fm)/Fm rChl=an estimate of chlorophyll (Note this is uncalibrated and is an estimate only) Temp=Temperature degrees C PAR=Photo active radiation PAR2= Photo active radiation2 DO=Dissolved oxygen %Sat= Saturation of dissolved oxygen Notes=This is the program of the underwater submersible logger with the following abreviations: Notes-1) PAM= Notes-2) PAM=Gain level set (see aquation manual for more detail) Notes-3) Acclimatisation= Program of slowly introducing treatment water into chamber Notes-4) Shutter start up 2 sensors+sample…= Shutter PAMs automatic set up procedure (see aquation manual) Notes-5) Yield step 2=PAM yield measurement and calculation of control Notes-6) Yield step 5= PAM yield measurement and calculation of acidified Notes-7) Abatus respiration DO and PAR step 1= Program to measure dissolved oxygen and PAR (see aquation manual). Steps 1-4 are different stages of this program including pump cycles, DO and PAR measurements.
8) Rapid light curve data Pre LC: A yield measurement prior to the following measurement After 10.0 sec at 0.5% to 8%: Level of each of the 8 steps of the rapid light curve Odessey PAR (only in some deployments): An extra measure of PAR (umols) using an Odessey data logger Dataflow PAR: An extra measure of PAR (umols) using a Dataflow sensor. PAM PAR: This is copied from the PAR or PAR2 column PAR all: This is the complete PAR file and should be used Deployment: Identifying which deployment the data came from
####
Respiration chamber biomass data
The data is chlorophyll a biomass from cores from the respiration chambers. The headers are: Depth (mm) Treat (Acidified or control) Chl a (pigment and indicator of biomass) Core (5 cores were collected from each chamber, three were analysed for chl a), these are psudoreplicates/subsamples from the chambers and should not be treated as replicates.
####
Associated R script file for pump cycles of respirations chambers
Associated respiration chamber data to determine the times when respiration chamber pumps delivered treatment water to chambers. Determined from Aquation log files (see associated files). Use the chamber cut times to determine net production rates. Note: Users need to avoid the times when the respiration chambers are delivering water as this will give incorrect results. The headers that get used in the attached/associated R file are start regression and end regression. The remaining headers are not used unless called for in the associated R script. The last columns of these datasets (intercept, ElapsedTimeMincoef) are determined from the linear regressions described below.
To determine the rate of change of net production, coefficients of the regression of oxygen consumption in discrete 180 minute data blocks were determined. R squared values for fitted regressions of these coefficients were consistently high (greater than 0.9). We make two assumptions with calculation of net production rates: the first is that heterotrophic community members do not change their metabolism under OA; and the second is that the heterotrophic communities are similar between treatments.
####
Combined dataset pH, temperature, oxygen, salinity, velocity for experiment
This data is rapid light curve data generated from a Shutter PAM fluorimeter. There are eight steps in each rapid light curve. Note: The software component of the Shutter PAM fluorimeter for sensor 44 appeared to be damaged and would not cycle through the PAR cycles. Therefore the rapid light curves and recovery curves should only be used for the control chambers (sensor ID43).
The headers are PAR: Photoactive radiation relETR: F0/Fm x PAR Notes: Stage/step of light curve Treatment: Acidified or control
The associated light treatments in each stage. Each actinic light intensity is held for 10 seconds, then a saturating pulse is taken (see PAM methods).
After 10.0 sec at 0.5% = 1 umols PAR After 10.0 sec at 0.7% = 1 umols PAR After 10.0 sec at 1.1% = 0.96 umols PAR After 10.0 sec at 1.6% = 4.32 umols PAR After 10.0 sec at 2.4% = 4.32 umols PAR After 10.0 sec at 3.6% = 8.31 umols PAR After 10.0 sec at 5.3% =15.78 umols PAR After 10.0 sec at 8.0% = 25.75 umols PAR
This dataset appears to be missing data, note D5 rows potentially not useable information
See the word document in the download file for more information.
This task focuses on sound event detection in a few-shot learning setting for animal (mammal and bird) vocalisations. Participants will be expected to create a method that can extract information from five exemplar vocalisations (shots) of mammals or birds and detect and classify sounds in field recordings.
For more info please reffer to the official website: https://dcase.community/challenge2023/task-few-shot-bioacoustic-event-detection
Few-shot learning is a highly promising paradigm for sound event detection. It is also an extremely good fit to the needs of users in bioacoustics, in which increasingly large acoustic datasets commonly need to be labelled for events of an identified category (e.g. species or call-type), even though this category might not be known in other datasets or have any yet-known label. While satisfying user needs, this will also benchmark few-shot learning for the wider domain of sound event detection (SED).
Few-shot learning describes tasks in which an algorithm must make predictions given only a few instances of each class, contrary to standard supervised learning paradigm. The main objective is to find reliable algorithms that are capable of dealing with data sparsity, class imbalance and noisy/busy environments. Few-shot learning is usually studied using N-way-K-shot classification, where N denotes the number of classes and K the number of examples for each class.
Some reasons why few-shot learning has been of increasing interest:
Scarcity of supervised data can lead to unreliable generalisations of machine learning models. Explicitly labeling a huge dataset can be costly both in time and resources. Fixed ontologies or class labels used in SED and other DCASE tasks are often a poor fit to a given user’s goal. Development Set The development set is pre-split into training and validation sets. The training set consists of five sub-folders deriving from a different source each. Along with the audio files multi-class annotations are provided for each. The validation set consists of two sub-folders deriving from a different source each, with a single-class (class of interest) annotation file provided for each audio file.
The training set contains four different sub-folders (BV, HV, JD, MT,WMW). Statistics are given overall and specific for each sub-folder.
Overall Statistics Values Number of audio recordings 174 Total duration 21 hours Total classes (excl. UNK) 47 Total events (excl. UNK) 14229
The BirdVox-DCASE-10h (BV for short) contains five audio files from four different autonomous recording units, each lasting two hours. These autonomous recording units are all located in Tompkins County, New York, United States. Furthermore, they follow the same hardware specification: the Recording and Observing Bird Identification Node (ROBIN) developed by the Cornell Lab of Ornithology. Andrew Farnsworth, an expert ornithologist, has annotated these recordings for the presence of flight calls from migratory passerines, namely: American sparrows, cardinals, thrushes, and warblers. In total, the annotator found 2,662 from 11 different species. We estimate these flight calls to have a duration of 150 milliseconds and a fundamental frequency between 2 kHz and 10 kHz.
Statistics Values Number of audio recordings 5 Total duration 10 hours Total classes (excl. UNK) 11 Total events (excl. UNK) 9026 Ratio event/duration 0.04 Sampling rate 24,000 Hz
Spotted hyenas are a highly social species that live in "fission-fusion" groups where group members range alone or in smaller subgroups that split and merge over time. Hyenas use a variety of types of vocalizations to coordinate with one another over both short and long distances. Spotted hyena vocalization data were recorded on custom-developed audio tags designed by Mark Johnson and integrated into combined GPS / acoustic collars (Followit Sweden AB) by Frants Jensen and Mark Johnson. Collars were deployed on female hyenas of the Talek West hyena clan at the MSU-Mara Hyena Project (directed by Kay Holekamp) in the Masai Mara, Kenya as part of a multi-species study on communication and collective behavior. Field work was carried out by Kay Holekamp, Andrew Gersick, Frants Jensen, Ariana Strandburg-Peshkin, and Benson Pion; labeling was done by Kenna Lehmann and colleagues.
Statistics Values Number of audio recordings 5 Total duration 5 hours Total classes (excl. UNK) 3 Total events (excl. UNK) 611 Ratio events/duration 0.05 Sampling rate 6000 Hz
Jackdaws are corvid songbirds which usually breed, forage and sleep in large groups, but form a pair bond with the same partner for life. They produce thousands of vocalisations per day, but many aspects of their vocal behaviour remained unexplored due to the difficulty in recording and assigning vocalisations to specific individuals, especia...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A collection of datasets and python scripts for extraction and analysis of isograms (and some palindromes and tautonyms) from corpus-based word-lists, specifically Google Ngram and the British National Corpus (BNC).Below follows a brief description, first, of the included datasets and, second, of the included scripts.1. DatasetsThe data from English Google Ngrams and the BNC is available in two formats: as a plain text CSV file and as a SQLite3 database.1.1 CSV formatThe CSV files for each dataset actually come in two parts: one labelled ".csv" and one ".totals". The ".csv" contains the actual extracted data, and the ".totals" file contains some basic summary statistics about the ".csv" dataset with the same name.The CSV files contain one row per data point, with the colums separated by a single tab stop. There are no labels at the top of the files. Each line has the following columns, in this order (the labels below are what I use in the database, which has an identical structure, see section below):
Label Data type Description
isogramy int The order of isogramy, e.g. "2" is a second order isogram
length int The length of the word in letters
word text The actual word/isogram in ASCII
source_pos text The Part of Speech tag from the original corpus
count int Token count (total number of occurences)
vol_count int Volume count (number of different sources which contain the word)
count_per_million int Token count per million words
vol_count_as_percent int Volume count as percentage of the total number of volumes
is_palindrome bool Whether the word is a palindrome (1) or not (0)
is_tautonym bool Whether the word is a tautonym (1) or not (0)
The ".totals" files have a slightly different format, with one row per data point, where the first column is the label and the second column is the associated value. The ".totals" files contain the following data:
Label
Data type
Description
!total_1grams
int
The total number of words in the corpus
!total_volumes
int
The total number of volumes (individual sources) in the corpus
!total_isograms
int
The total number of isograms found in the corpus (before compacting)
!total_palindromes
int
How many of the isograms found are palindromes
!total_tautonyms
int
How many of the isograms found are tautonyms
The CSV files are mainly useful for further automated data processing. For working with the data set directly (e.g. to do statistics or cross-check entries), I would recommend using the database format described below.1.2 SQLite database formatOn the other hand, the SQLite database combines the data from all four of the plain text files, and adds various useful combinations of the two datasets, namely:• Compacted versions of each dataset, where identical headwords are combined into a single entry.• A combined compacted dataset, combining and compacting the data from both Ngrams and the BNC.• An intersected dataset, which contains only those words which are found in both the Ngrams and the BNC dataset.The intersected dataset is by far the least noisy, but is missing some real isograms, too.The columns/layout of each of the tables in the database is identical to that described for the CSV/.totals files above.To get an idea of the various ways the database can be queried for various bits of data see the R script described below, which computes statistics based on the SQLite database.2. ScriptsThere are three scripts: one for tiding Ngram and BNC word lists and extracting isograms, one to create a neat SQLite database from the output, and one to compute some basic statistics from the data. The first script can be run using Python 3, the second script can be run using SQLite 3 from the command line, and the third script can be run in R/RStudio (R version 3).2.1 Source dataThe scripts were written to work with word lists from Google Ngram and the BNC, which can be obtained from http://storage.googleapis.com/books/ngrams/books/datasetsv2.html and [https://www.kilgarriff.co.uk/bnc-readme.html], (download all.al.gz).For Ngram the script expects the path to the directory containing the various files, for BNC the direct path to the *.gz file.2.2 Data preparationBefore processing proper, the word lists need to be tidied to exclude superfluous material and some of the most obvious noise. This will also bring them into a uniform format.Tidying and reformatting can be done by running one of the following commands:python isograms.py --ngrams --indir=INDIR --outfile=OUTFILEpython isograms.py --bnc --indir=INFILE --outfile=OUTFILEReplace INDIR/INFILE with the input directory or filename and OUTFILE with the filename for the tidied and reformatted output.2.3 Isogram ExtractionAfter preparing the data as above, isograms can be extracted from by running the following command on the reformatted and tidied files:python isograms.py --batch --infile=INFILE --outfile=OUTFILEHere INFILE should refer the the output from the previosu data cleaning process. Please note that the script will actually write two output files, one named OUTFILE with a word list of all the isograms and their associated frequency data, and one named "OUTFILE.totals" with very basic summary statistics.2.4 Creating a SQLite3 databaseThe output data from the above step can be easily collated into a SQLite3 database which allows for easy querying of the data directly for specific properties. The database can be created by following these steps:1. Make sure the files with the Ngrams and BNC data are named “ngrams-isograms.csv” and “bnc-isograms.csv” respectively. (The script assumes you have both of them, if you only want to load one, just create an empty file for the other one).2. Copy the “create-database.sql” script into the same directory as the two data files.3. On the command line, go to the directory where the files and the SQL script are. 4. Type: sqlite3 isograms.db 5. This will create a database called “isograms.db”.See the section 1 for a basic descript of the output data and how to work with the database.2.5 Statistical processingThe repository includes an R script (R version 3) named “statistics.r” that computes a number of statistics about the distribution of isograms by length, frequency, contextual diversity, etc. This can be used as a starting point for running your own stats. It uses RSQLite to access the SQLite database version of the data described above.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
EyeFi Dataset
This dataset is collected as a part of the EyeFi project at Bosch Research and Technology Center, Pittsburgh, PA, USA. The dataset contains WiFi CSI values of human motion trajectories along with ground truth location information captured through a camera. This dataset is used in the following paper "EyeFi: Fast Human Identification Through Vision and WiFi-based Trajectory Matching" that is published in the IEEE International Conference on Distributed Computing in Sensor Systems 2020 (DCOSS '20). We also published a dataset paper titled as "Dataset: Person Tracking and Identification using Cameras and Wi-Fi Channel State Information (CSI) from Smartphones" in Data: Acquisition to Analysis 2020 (DATA '20) workshop describing details of data collection. Please check it out for more information on the dataset.
Data Collection Setup
In our experiments, we used Intel 5300 WiFi Network Interface Card (NIC) installed in an Intel NUC and Linux CSI tools [1] to extract the WiFi CSI packets. The (x,y) coordinates of the subjects are collected from Bosch Flexidome IP Panoramic 7000 panoramic camera mounted on the ceiling and Angle of Arrivals (AoAs) are derived from the (x,y) coordinates. Both the WiFi card and camera are located at the same origin coordinates but at different height, the camera is location around 2.85m from the ground and WiFi antennas are around 1.12m above the ground.
The data collection environment consists of two areas, first one is a rectangular space measured 11.8m x 8.74m, and the second space is an irregularly shaped kitchen area with maximum distances of 19.74m and 14.24m between two walls. The kitchen also has numerous obstacles and different materials that pose different RF reflection characteristics including strong reflectors such as metal refrigerators and dishwashers.
To collect the WiFi data, we used a Google Pixel 2 XL smartphone as an access point and connect the Intel 5300 NIC to it for WiFi communication. The transmission rate is about 20-25 packets per second. The same WiFi card and phone are used in both lab and kitchen area.
List of Files
Here is a list of files included in the dataset:
|- 1_person
|- 1_person_1.h5
|- 1_person_2.h5
|- 2_people
|- 2_people_1.h5
|- 2_people_2.h5
|- 2_people_3.h5
|- 3_people
|- 3_people_1.h5
|- 3_people_2.h5
|- 3_people_3.h5
|- 5_people
|- 5_people_1.h5
|- 5_people_2.h5
|- 5_people_3.h5
|- 5_people_4.h5
|- 10_people
|- 10_people_1.h5
|- 10_people_2.h5
|- 10_people_3.h5
|- Kitchen
|- 1_person
|- kitchen_1_person_1.h5
|- kitchen_1_person_2.h5
|- kitchen_1_person_3.h5
|- 3_people
|- kitchen_3_people_1.h5
|- training
|- shuffuled_train.h5
|- shuffuled_valid.h5
|- shuffuled_test.h5
View-Dataset-Example.ipynb
README.md
In this dataset, folder `1_person/` , `2_people/` , `3_people/` , `5_people/`, and `10_people/` contains data collected from the lab area whereas `Kitchen/` folder contains data collected from the kitchen area. To see how the each file is structured, please see below in section Access the data.
The training folder contains the training dataset we used to train the neural network discussed in our paper. They are generated by shuffling all the data from `1_person/` folder collected in the lab area (`1_person_1.h5` and `1_person_2.h5`).
Why multiple files in one folder?
Each folder contains multiple files. For example, `1_person` folder has two files: `1_person_1.h5` and `1_person_2.h5`. Files in the same folder always have the same number of human subjects present simultaneously in the scene. However, the person who is holding the phone can be different. Also, the data could be collected through different days and/or the data collection system needs to be rebooted due to stability issue. As result, we provided different files (like `1_person_1.h5`, `1_person_2.h5`) to distinguish different person who is holding the phone and possible system reboot that introduces different phase offsets (see below) in the system.
Special note:
For `1_person_1.h5`, this file is generated by the same person who is holding the phone, and `1_person_2.h5` contains different people holding the phone but only one person is present in the area at a time. Boths files are collected in different days as well.
Access the data
To access the data, hdf5 library is needed to open the dataset. There are free HDF5 viewer available on the official website: https://www.hdfgroup.org/downloads/hdfview/. We also provide an example Python code View-Dataset-Example.ipynb to demonstrate how to access the data.
Each file is structured as (except the files under *"training/"* folder):
|- csi_imag
|- csi_real
|- nPaths_1
|- offset_00
|- spotfi_aoa
|- offset_11
|- spotfi_aoa
|- offset_12
|- spotfi_aoa
|- offset_21
|- spotfi_aoa
|- offset_22
|- spotfi_aoa
|- nPaths_2
|- offset_00
|- spotfi_aoa
|- offset_11
|- spotfi_aoa
|- offset_12
|- spotfi_aoa
|- offset_21
|- spotfi_aoa
|- offset_22
|- spotfi_aoa
|- nPaths_3
|- offset_00
|- spotfi_aoa
|- offset_11
|- spotfi_aoa
|- offset_12
|- spotfi_aoa
|- offset_21
|- spotfi_aoa
|- offset_22
|- spotfi_aoa
|- nPaths_4
|- offset_00
|- spotfi_aoa
|- offset_11
|- spotfi_aoa
|- offset_12
|- spotfi_aoa
|- offset_21
|- spotfi_aoa
|- offset_22
|- spotfi_aoa
|- num_obj
|- obj_0
|- cam_aoa
|- coordinates
|- obj_1
|- cam_aoa
|- coordinates
...
|- timestamp
The `csi_real` and `csi_imag` are the real and imagenary part of the CSI measurements. The order of antennas and subcarriers are as follows for the 90 `csi_real` and `csi_imag` values : [subcarrier1-antenna1, subcarrier1-antenna2, subcarrier1-antenna3, subcarrier2-antenna1, subcarrier2-antenna2, subcarrier2-antenna3,… subcarrier30-antenna1, subcarrier30-antenna2, subcarrier30-antenna3]. `nPaths_x` group are SpotFi [2] calculated WiFi Angle of Arrival (AoA) with `x` number of multiple paths specified during calculation. Under the `nPath_x` group are `offset_xx` subgroup where `xx` stands for the offset combination used to correct the phase offset during the SpotFi calculation. We measured the offsets as:
|Antennas | Offset 1 (rad) | Offset 2 (rad) |
|:-------:|:---------------:|:-------------:|
| 1 & 2 | 1.1899 | -2.0071
| 1 & 3 | 1.3883 | -1.8129
The measurement is based on the work [3], where the authors state there are two possible offsets between two antennas which we measured by booting the device multiple times. The combination of the offset are used for the `offset_xx` naming. For example, `offset_12` is offset 1 between antenna 1 & 2 and offset 2 between antenna 1 & 3 are used in the SpotFi calculation.
The `num_obj` field is used to store the number of human subjects present in the scene. The `obj_0` is always the subject who is holding the phone. In each file, there are `num_obj` of `obj_x`. For each `obj_x1`, we have the `coordinates` reported from the camera and `cam_aoa`, which is estimated AoA from the camera reported coordinates. The (x,y) coordinates and AoA listed here are chronologically ordered (except the files in the `training` folder) . It reflects the way the person carried the phone moved in the space (for `obj_0`) and everyone else walked (for other `obj_y`, where `y` > 0).
The `timestamp` is provided here for time reference for each WiFi packets.
To access the data (Python):
import h5py
data = h5py.File('3_people_3.h5','r')
csi_real = data['csi_real'][()]
csi_imag = data['csi_imag'][()]
cam_aoa = data['obj_0/cam_aoa'][()]
cam_loc = data['obj_0/coordinates'][()]
For file inside `training/` folder:
Files inside training folder has a different data structure:
|- nPath-1
|- aoa
|- csi_imag
|- csi_real
|- spotfi
|- nPath-2
|- aoa
|- csi_imag
|- csi_real
|- spotfi
|- nPath-3
|- aoa
|- csi_imag
|- csi_real
|- spotfi
|- nPath-4
|- aoa
|- csi_imag
|- csi_real
|- spotfi
The group `nPath-x` is the number of multiple path specified during the SpotFi calculation. `aoa` is the camera generated angle of arrival (AoA) (can be considered as ground truth), `csi_image` and `csi_real` is the imaginary and real component of the CSI value. `spotfi` is the SpotFi calculated AoA values. The SpotFi values are chosen based on the lowest median and mean error from across `1_person_1.h5` and `1_person_2.h5`. All the rows under the same `nPath-x` group are aligned (i.e., first row of `aoa` corresponds to the first row of `csi_imag`, `csi_real`, and `spotfi`. There is no timestamp recorded and the sequence of the data is not chronological as they are randomly shuffled from the `1_person_1.h5` and `1_person_2.h5` files.
Citation
If you use the dataset, please cite our paper:
@inproceedings{eyefi2020,
title={EyeFi: Fast Human Identification Through Vision and WiFi-based Trajectory Matching},
author={Fang, Shiwei and Islam, Tamzeed and Munir, Sirajum and Nirjon, Shahriar},
booktitle={2020 IEEE International Conference on Distributed Computing in Sensor Systems (DCOSS)},
year={2020},
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Buzsaki Lab is proud to present a large selection of experimental data available for public access: https://buzsakilab.com/wp/database/. We publicly share more than a thousand sessions (about 40TB of raw and spike- and LFP-processed data) via our public data repository. The datasets are from freely moving rodents and include sleep-task-sleep sessions (3 to 24 hrs continuous recording sessions) in various brain structures, including metadata. We are happy to assist you in using the data. Our goal is that by sharing these data, other users can provide new insights, extend, contradict, or clarify our conclusions.
The databank contains electrophysiological recordings performed in freely moving rats and mice collected by investigators in the Buzsaki Lab over several years (a subset from head-fixed mice). Sessions have been collected with extracellular electrodes using high-channel-count silicon probes, with spike sorted single units, and intracellular and juxtacellular combined with extracellular electrodes. Several sessions include physiologically and optogenetically identified units. The sessions have been collected from various brain region pairs: the hippocampus, thalamus, amygdala, post-subiculum, septal region, and the entorhinal cortex, and various neocortical regions. In most behavioral tasks, the animals performed spatial behaviors (linear mazes and open fields), preceded and followed by long sleep sessions. Brain state classification is provided.
Getting started
The top menu “Databank” serves as a navigational menu to the databank. The metadata describing the experiments is stored in a relational database which means that there are many entry points for exploring the data. The databank is organized by projects, animal subjects, and sessions.
Accessing and downloading the datasets
We share the data through two services: our public Globus.org endpoint and our webshare: buzsakilab.nyumc.org. A subset of the datasets is also available at CRCNS.org. If you have an interest in a dataset that is not listed or is lacking information, please contact us. We pledge to make our data available immediately after publication.
Support
For support, please use our Buzsaki Databank google group. If you have an interest in a dataset that is not listed or is lacking information, please send us a request. Feel free to contact us, if you need more details on a given dataset or if a dataset is missing.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
About the NUDA DatasetMedia bias is a multifaceted problem, leading to one-sided views and impacting decision-making. A way to address bias in news articles is to automatically detect and indicate it through machine-learning methods. However, such detection is limited due to the difficulty of obtaining reliable training data. To facilitate the data-gathering process, we introduce NewsUnravel, a news-reading web application leveraging an initially tested feedback mechanism to collect reader feedback on machine-generated bias highlights within news articles. Our approach augments dataset quality by significantly increasing inter-annotator agreement by 26.31% and improving classifier performance by 2.49%. As the first human-in-the-loop application for media bias, NewsUnravel shows that a user-centric approach to media bias data collection can return reliable data while being scalable and evaluated as easy to use. NewsUnravel demonstrates that feedback mechanisms are a promising strategy to reduce data collection expenses, fluidly adapt to changes in language, and enhance evaluators' diversity.
General
This dataset was created through user feedback on automatically generated bias highlights on news articles on the website NewsUnravel made by ANON. Its goal is to improve the detection of linguistic media bias for analysis and to indicate it to the public. Support came from ANON. None of the funders played any role in the dataset creation process or publication-related decisions.
The dataset consists of text, namely biased sentences with binary bias labels (processed, biased or not biased) as well as metadata about the article. It includes all feedback that was given. The single ratings (unprocessed) used to create the labels with correlating User IDs are included.
For training, this dataset was combined with the BABE dataset. All data is completely anonymous. Some sentences might be offensive or triggering as they were taken from biased or more extreme news sources. The dataset does not identify sub-populations or can be considered sensitive to them, nor is it possible to identify individuals.
Description of the Data Files
This repository contains the datasets for the anonymous NewsUnravel submission. The tables contain the following data:
NUDAdataset.csv: the NUDA dataset with 310 new sentences with bias labelsStatistics.png: contains all Umami statistics for NewsUnravel's usage dataFeedback.csv: holds the participantID of a single feedback with the sentence ID (contentId), the bias rating, and provided reasonsContent.csv: holds the participant ID of a rating with the sentence ID (contentId) of a rated sentence and the bias rating, and reason, if givenArticle.csv: holds the article ID, title, source, article metadata, article topic, and bias amount in %Participant.csv: holds the participant IDs and data processing consent
Collection Process
Data was collected through interactions with the Feedback Mechanism on NewsUnravel. A news article was displayed with automatically generated bias highlights. Each highlight could be selected, and readers were able to agree or disagree with the automatic label. Through a majority vote, labels were generated from those feedback interactions. Spammers were excluded through a spam detection approach.
Readers came to our website voluntarily through posts on LinkedIn and social media as well as posts on university boards. The data collection period lasted for one week, from March 4th to March 11th (2023). The landing page informed them about the goal and the data processing. After being informed, they could proceed to the article overview.
So far, the dataset has been used on top of BABE to train a linguistic bias classifier, adopting hyperparameter configurations from BABE with a pre-trained model from Hugging Face.The dataset will be open source. On acceptance, a link with all details and contact information will be provided. No third parties are involved.
The dataset will not be maintained as it captures the first test of NewsUnravel at a specific point in time. However, new datasets will arise from further iterations. Those will be linked in the repository. Please cite the NewsUnravel paper if you use the dataset and contact us if you're interested in more information or joining the project.
EN: The dataset is based on tables with detailed data for municipalities and boroughs of the population census and the occupational census of the Netherlands 1947. These detailed tables from the archive of Statistics Netherlands never have been published. They are written on so-called ‘transparanten’, sheets in A4-format. The set contains more than 35 table-types, some of which spread over two or more sheets, some combined on one sheet.
Image scans of the detailed tables have been made in February 2005. Those scans, 29489 in total were published on www.volkstellingen.nl, ordered by province and municipality. In a later stage the scans have been converted by data-entry to Excel worksheets. In most cases one scan has been converted to one Excel file. However, if a scan contains two or more tables, a separate Excel file is made for each table. The Excel files also have been converted to CSV-text files.
The thematic collection: 12th Population Census 31 May 1947 contains 11 datasets for the provinces plus one dataset for the Netherlands as a whole. The documentation for any dataset in the collection contains a description of the contents of all table-types and the instruction given for data-entry.
This dataset regards the files of the Netherlands as a whole. The files are grouped by province.
The metadata per file (details) contains the table number. An overview of table numbers by file is contained in ‘Table number per scan_Nederland.csv’. This applies for the scans as well as the Excel files and the CSV-text files. The file 'Titles of Tables' shows the table numbers with the corresponding titles of the tables.
NL: De dataset is gebaseerd op gedetailleerde tabellen op plaatselijk en wijkniveau van de Volks- en Beroepstellingen 1947. Deze gedetailleerde tabellen uit het CBS-archief zijn nooit gepubliceerd. Zij stonden op perkamentachtig papier (‘transparanten’) in A4-formaat. Het betreft meer dan 35 tabeltypen, waarvan sommige per tabel op één transparant, sommige per tabel gespreid over twee of meer transparanten (afhankelijk van de grootte van de gemeente) en enkele met twee of drie tabellen op één transparant.
Van deze gedetailleerde tabellen zijn in februari 2005 tijdens de Landelijke Contactdag Document Management image scans gemaakt in JPEG-formaat. De in totaal 29489 scans zijn in eerste instantie opgenomen op de website www.volkstellingen.nl, geordend per provincie en gemeente. Later zijn de scans met data-entry overgenomen in Excelbestanden. In principe is van elke scan één Excelbestand gemaakt. Alleen als een scan twee of meer tabellen bevat, is van elke tabel een afzonderlijk Excelbestand gemaakt. De Excelbestanden zijn ook geconverteerd naar CSV-tekstbestanden.
De collectie datasets ‘Volks- en Beroepentellingen 1947’ bestaat uit 11 datasets voor de provincies plus een dataset voor Nederland als geheel. De documentatie voor alle datasets in deze collectie omvat onder meer een beschrijving van de inhoud van elk tabeltype en de instructies die zijn gegeven voor de data-entry.
Deze dataset betreft de bestanden van Nederland als geheel. De bestanden zijn ingedeeld per provincie.
De metadata per bestand (details) bevat het tabelnummer. Een overzicht met het tabelnummer per bestand staat in ‘Table number per scan_Nederland.csv’. Dat is ook van toepassing op de bijbehorende Excelbestanden en CSV-tekstbestanden. Het bestand 'Titles of Tables' geeft een overzicht van de tabelnummers met de bijbehorende tabelnamen. Dit bestand is beschikbaar gesteld als pdf-document en als CSV-tekstbestand.
12de volkstelling 31 mei 1947 - Nederland
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context
The dataset presents median household incomes for various household sizes in Combine, TX, as reported by the U.S. Census Bureau. The dataset highlights the variation in median household income with the size of the family unit, offering valuable insights into economic trends and disparities within different household sizes, aiding in data analysis and decision-making.
Key observations
https://i.neilsberg.com/ch/combine-tx-median-household-income-by-household-size.jpeg" alt="Combine, TX median household income, by household size (in 2022 inflation-adjusted dollars)">
When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates.
Household Sizes:
Variables / Data Columns
Good to know
Margin of Error
Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.
Custom data
If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.
Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.
This dataset is a part of the main dataset for Combine median household income. You can refer the same here