78 datasets found

Gender Recognition by Voice(processed)
kaggle.com
Updated Jan 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
murtadha najim (2025). Gender Recognition by Voice(processed) [Dataset]. https://www.kaggle.com/datasets/murtadhanajim/vocal-gender-features
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 18, 2025
Dataset provided by
Kaggle
Authors
murtadha najim
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This dataset is a cleaned and processed version of raw audio files for gender classification. The features were extracted from .wav audio recordings collected in a quiet room with no background noise. The data contains no null or duplicate values, ensuring a high-quality starting point for analysis and modeling.

Features:

The dataset includes the following extracted audio features:

mean_spectral_centroid: The average spectral centroid, representing the "center of mass" of the spectrum, indicating brightness. std_spectral_centroid: The standard deviation of the spectral centroid, measuring variability in brightness. mean_spectral_bandwidth: The average width of the spectrum, reflecting how spread out the frequencies are. std_spectral_bandwidth: The standard deviation of spectral bandwidth, indicating variability in frequency spread. mean_spectral_contrast: The average difference between peaks and valleys in the spectrum, indicating tonal contrast. mean_spectral_flatness: The average flatness of the spectrum, measuring the noisiness of the signal. mean_spectral_rolloff: The average frequency below which a specified percentage of the spectral energy resides, indicating sharpness. zero_crossing_rate: The rate at which the signal crosses the zero amplitude axis, representing noisiness or percussiveness. rms_energy: The root mean square energy of the signal, reflecting its loudness. mean_pitch: The average pitch frequency of the audio. min_pitch: The minimum pitch frequency. max_pitch: The maximum pitch frequency. std_pitch: The standard deviation of pitch frequency, measuring variability in pitch. spectral_skew: The skewness of the spectral distribution, indicating asymmetry. spectral_kurtosis: The kurtosis of the spectral distribution, indicating the peakiness of the spectrum. energy_entropy: The entropy of the signal energy, representing its randomness. log_energy: The logarithmic energy of the signal, a compressed representation of energy. mfcc_1_mean to mfcc_13_mean: The mean of the first 13 Mel Frequency Cepstral Coefficients (MFCCs), representing the timbral characteristics of the audio. mfcc_1_std to mfcc_13_std: The standard deviation of the first 13 MFCCs, indicating variability in timbral features. label: The target variable indicating the gender male(1) or female(0).

Key Information:

Clean Data: The dataset has been thoroughly cleaned and contains no null or duplicate values. Unscaled: The features are not scaled, allowing users to apply their preferred scaling or normalization techniques. Feature Extraction: The function used for feature extraction is available in the notebook in the Code section. High Performance: The data achieved 95%+ accuracy using machine learning models such as Random Forest, Extra Trees, and K-Nearest Neighbors (KNN). It also performed exceptionally well with neural networks.

Recommendations:

Feature Selection: Avoid using all features in modeling to prevent overfitting. Instead, perform feature selection and choose the most impactful features based on your analysis.

This processed dataset is a reliable and robust foundation for building high-performing models. and if you need any help, you can visit my notebook
d
Chlorophyll-a Standard Deviation of Long-Term Mean, 2002-2013 - Hawaii
catalog.data.gov
data.ioos.us
+1more
Updated Jan 27, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Center for Ecological Analysis and Synthesis (NCEAS) (Point of Contact) (2025). Chlorophyll-a Standard Deviation of Long-Term Mean, 2002-2013 - Hawaii [Dataset]. https://catalog.data.gov/dataset/chlorophyll-a-standard-deviation-of-long-term-mean-2002-2013-hawaii
Explore at:
Dataset updated
Jan 27, 2025
Dataset provided by
National Center for Ecological Analysis and Synthesis (NCEAS) (Point of Contact)
Area covered
Hawaii
Description
Chlorophyll-a is a widely used proxy for phytoplankton biomass and an indicator for changes in phytoplankton production. As an essential source of energy in the marine environment, the extent and availability of phytoplankton biomass can be highly influential for fisheries production and dictate trophic structure in marine ecosystems. Changes in phytoplankton biomass are predominantly effected by changes in nutrient availability, through either natural (e.g., turbulent ocean mixing) or anthropogenic (e.g., agricultural runoff) processes. This layer represents the standard deviation of the 8-day time series of chlorophyll-a (mg/m3) from 2002-2013. Monthly and 8-day 4-km (0.0417-degree) spatial resolution data were obtained from the MODIS (Moderate-resolution Imaging Spectroradiometer) Aqua satellite instrument from the NASA OceanColor website (http://oceancolor.gsfc.nasa.gov). The standard deviation was calculated over all 8-day chlorophyll-a data from 2002-2013 for each pixel. A quality control mask was applied to remove spurious data associated with shallow water, following Gove et al., 2013. Nearshore map pixels with no data were filled with values from the nearest neighboring valid offshore pixel by using a grid of points and the Near Analysis tool in ArcGIS then converting points to raster.
Data from: Polonium-210 and Lead-210 activities measured on 17 water bottle...
search.datacite.org
doi.pangaea.de
+1more
Updated 2011
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jana Friedrich (2011). Polonium-210 and Lead-210 activities measured on 17 water bottle profiles and 50 surface water samples during POLARSTERN cruise ARK-XXII/2 [Dataset]. http://doi.org/10.1594/pangaea.763937
Explore at:
Unique identifier
https://doi.org/10.1594/pangaea.763937
Dataset updated
2011
Dataset provided by
DataCitehttps://www.datacite.org/
PANGAEA - Data Publisher for Earth & Environmental Science
Authors
Jana Friedrich
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Area covered

Description
Polonium-210 and Lead-210 have been measured in the water column and on suspended particulate matter during the POLARSTERN cruise ARK-XXII/2. The data have been submitted to Pangaea following a Polonium-Lead intercalibration exercise organized by GEOTRACES, where the AWI lab results range within the data standard deviation from 10 participating labs.Polonium-210 and Lead-210 in the ocean can be used to identify the sources and sinks of suspended matter. In seawater, Polonium-210 (210Po) and Lead-210 (210Pb) are produced by stepwise radioactive decay of Uranium-238. 210Po (138 days half life) and 210Pb (22.3 years half life) have high affinities for suspended particles. Those radionuclides are present in dissolved form and adsorbed onto particles. Following adsorption onto particle surfaces, 210Po especially is transported into the interior of cells where it bonds to proteins. In this way, 210Po also accumulates in the food chain. 210Po is therefore considered to be a good tracer for POC, and traces particle export over a timescale of month. 210Pb (22.3 years half life) adsorbs preferably onto structural components of cells, biogenic silica and lithogenic particles, and is therefore a better tracer more rapidly sinking matter.Our goal during ARK XXII/2 was to trace pathways of particulate and dissolved matter leaving the Siberian Shelf. The pathways of particulate and dissolved matter will be followed by the combined use of 210Po and 234Th as a tracer pair (and perhaps 210Pb) for particle flux (Cai, P.; Rutgers van der Loeff, MM (2008) doi:10.1594/PANGAEA.708354). This information gathered from the water column will be complemented with the results of the 210Po-210Pb study in sea ice (Camara-Mor, P, Instituto de Ciencias del Mar-SCIC, Barcelona, Spain) to provide a more thorough picture of particle transport from the shelf to the open sea and from surface to depth.
Pakistan House Price dataset
kaggle.com
zip
Updated May 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jillani SofTech (2023). Pakistan House Price dataset [Dataset]. https://www.kaggle.com/datasets/jillanisofttech/pakistan-house-price-dataset/versions/1
Explore at:
zip(8379623 bytes)Available download formats
Dataset updated
May 6, 2023
Authors
Jillani SofTech
Area covered
Pakistan
Description
Dataset Description: The dataset contains information about properties. Each property has a unique property ID and is associated with a location ID based on the subcategory of the city. The dataset includes the following attributes:

Property ID: Unique identifier for each property. Location ID: Unique identifier for each location within a city. Page URL: The URL of the webpage where the property was published. Property Type: Categorization of the property into six types: House, FarmHouse, Upper Portion, Lower Portion, Flat, or Room. Price: The price of the property, which is the dependent feature in this dataset. City: The city where the property is located. The dataset includes five cities: Lahore, Karachi, Faisalabad, Rawalpindi, and Islamabad. Province: The state or province where the city is located. Location: Different types of locations within each city. Latitude and Longitude: Geographic coordinates of the cities. Steps Involved in the Analysis:

Statistical Analysis:

Data Types: Determine the data types of the attributes. Level of Measurement: Identify the level of measurement for each attribute. Summary Statistics: Calculate mean, standard deviation, minimum, and maximum values for numerical attributes. Data Cleaning:

Filling Null Values: Handle missing values in the dataset. Duplicate Values: Remove duplicate records, if any. Correcting Data Types: Ensure the correct data types for each attribute. Outliers Detection: Identify and handle outliers in the data. Exploratory Data Analysis (EDA):

Visualization: Use libraries such as Seaborn, Matplotlib, and Plotly to visualize the data and gain insights. Model Building:

Libraries: Utilize libraries like Sklearn and pickle. List of Models: Build models using Linear Regression, Decision Tree, Random Forest, K-Nearest Neighbors (KNN), XG Boost, Gradient Boost, and Ada Boost. Model Saving: Save the selected model into a pickle file for future use. I hope this captures the essence of the provided information. Let me know if you need any further assistance!
Data from: (Table 3) Contribution of different C species and eCA to...
search.datacite.org
doi.pangaea.de
Updated 2011
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PANGAEA (2011). (Table 3) Contribution of different C species and eCA to phytoplanktonic DIC uptake in water samples obtained during POLARSTERN cruise ANT-XXIV/3 [Dataset]. http://doi.org/10.1594/pangaea.817742
Explore at:
Unique identifier
https://doi.org/10.1594/pangaea.817742
Dataset updated
2011
Dataset provided by
DataCitehttps://www.datacite.org/
PANGAEA
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Area covered

Description
The SE's for the contribution of eCA were calculated from the SE's for HCO3- contributions determined from control and eCA inhibited experiments using SE propagation.
N
Median Household Income Variation by Family Size in Clear Lake Township,...
neilsberg.com
csv, json
Updated Jan 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neilsberg Research (2024). Median Household Income Variation by Family Size in Clear Lake Township, Minnesota: Comparative analysis across 7 household sizes [Dataset]. https://www.neilsberg.com/research/datasets/1ac88126-73fd-11ee-949f-3860777c1fe6/
Explore at:
csv, jsonAvailable download formats
Dataset updated
Jan 11, 2024
Dataset authored and provided by
Neilsberg Research
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Minnesota, Clear Lake Township
Variables measured
Household size, Median Household Income
Measurement technique
The data presented in this dataset is derived from the U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates. It delineates income distributions across 7 household sizes (mentioned above) following an initial analysis and categorization. Using this dataset, you can find out how household income varies with the size of the family unit. For additional information about these estimations, please contact us via email at research@neilsberg.com
Dataset funded by
Neilsberg Research
Description
About this dataset

Context

The dataset presents median household incomes for various household sizes in Clear Lake Township, Minnesota, as reported by the U.S. Census Bureau. The dataset highlights the variation in median household income with the size of the family unit, offering valuable insights into economic trends and disparities within different household sizes, aiding in data analysis and decision-making.

Key observations

Of the 7 household sizes (1 person to 7-or-more person households) reported by the census bureau, Clear Lake township did not include 6-person households. Across the different household sizes in Clear Lake township the mean income is $146,611, and the standard deviation is $75,595. The coefficient of variation (CV) is 51.56%. This high CV indicates high relative variability, suggesting that the incomes vary significantly across different sizes of households. Please note that the U.S. Census Bureau uses $250,001 as a JAM value to report incomes of $250,000 or more. In the case of Clear Lake township, there were 1 household sizes where the JAM values were used. Thus, the numbers for the mean and standard deviation may not be entirely accurate and have a higher possibility of errors. However, to obtain an approximate estimate, we have used a value of $250,001 as the income for calculations, as reported in the datasets by the U.S. Census Bureau.

In the most recent year, 2021, The smallest household size for which the bureau reported a median household income was 1-person households, with an income of $57,018. It then further increased to $270,229 for 7-person households, the largest household size for which the bureau reported a median household income.

https://i.neilsberg.com/ch/clear-lake-township-mn-median-household-income-by-household-size.jpeg" alt="Clear Lake Township, Minnesota median household income, by household size (in 2022 inflation-adjusted dollars)">

Content

When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates.

Household Sizes:

1-person households

2-person households

3-person households

4-person households

5-person households

6-person households

7-or-more-person households

Variables / Data Columns

Household Size: This column showcases 7 household sizes ranging from 1-person households to 7-or-more-person households (As mentioned above).

Median Household Income: Median household income, in 2022 inflation-adjusted dollars for the specific household size.

Good to know

Margin of Error

Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

Custom data

If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

Inspiration

Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

Recommended for further research

This dataset is a part of the main dataset for Clear Lake township median household income. You can refer the same here
Data from: Yield Editor 2.0.7
catalog.data.gov
s.cnmilf.com
Updated Apr 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Yield Editor 2.0.7 [Dataset]. https://catalog.data.gov/dataset/yield-editor-2-0-7-fc11a
Explore at:
Dataset updated
Apr 21, 2025
Dataset provided by
Agricultural Research Servicehttps://www.ars.usda.gov/
Description
Yield Editor is a tool which allows the user to select, apply and analyze a variety of automated filters and editing techniques used to process and clean yield data. The software imports either AgLeader advanced or Greenstar text file formats, and exports data in a delimited ASCII format. Yield Editor 2.0.7 includes some of the improvements and updates that users of the software have asked to be included. It provides three major improvements over version 1.0.2. The most important of these is the inclusion of a module for automated selection of many yield filter values, as well as a couple of additional automated filter types. A legend tool has been added which allows for the viewing of multiple data streams. Finally, a command line interface language under development allows for automated batch mode processing of large yield datasets. Yield maps provide important information for developing and evaluating precision management strategies. The high-quality yield maps needed for decision-making require screening raw yield monitor datasets for errors and removing them before maps are made. To facilitate this process, we developed the Yield Editor interactive software which has been widely used by producers, consultants and researchers. Some of the most difficult and time consuming issues involved in cleaning yield maps include determination of combine delay times, and the removal of “overlapped” data, especially near end rows. Our new Yield Editor 2.0 automates these and other tasks, significantly increasing the reliability and reducing the difficulty of creating accurate yield maps. This paper describes this new software, with emphasis on the Automated Yield Cleaning Expert (AYCE) module. Application of Yield Editor 2.0 is illustrated through comparison of automated AYCE cleaning to the interactive approach available in Yield Editor 1.x. On a test set of fifty grain yield maps, AYCE cleaning was not significantly different than interactive cleaning by an expert user when examining field mean yield, yield standard deviation, and number of yield observations remaining after cleaning. Yield Editor 2.0 provides greatly improved efficiency and equivalent accuracy compared to the interactive methods available in Yield Editor 1.x. Resources in this dataset:Resource Title: Yield Editor 2.0.7. File Name: Web Page, url: https://www.ars.usda.gov/research/software/download/?softwareid=370&modecode=50-70-10-00 download page: https://www.ars.usda.gov/research/software/download/?softwareid=370&modecode=50-70-10-00
d
Data from: Nd-Sr isotopic composition of foraminifera and bulk sediments
search.dataone.org
doi.pangaea.de
Updated Jan 6, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vance, Derek; Scrivner, Adam E; Beney, Patricia; Staubwasser, Michael; Henderson, Gideon M; Slowey, Niall C (2018). Nd-Sr isotopic composition of foraminifera and bulk sediments [Dataset]. http://doi.org/10.1594/PANGAEA.839893
Explore at:
Unique identifier
https://doi.org/10.1594/PANGAEA.839893
Dataset updated
Jan 6, 2018
Dataset provided by
PANGAEA Data Publisher for Earth and Environmental Science
Authors
Vance, Derek; Scrivner, Adam E; Beney, Patricia; Staubwasser, Michael; Henderson, Gideon M; Slowey, Niall C
Time period covered
Jun 15, 1988 - Mar 1, 2002
Area covered
Description
We present new isotopic data for sedimentary planktonic foraminifera, as well as for potential water column and sedimentary sources of neodymium (Nd), which confirm that the isotopic composition of the foraminifera is the same as surface seawater and very different from deep water and sedimentary Nd. The faithfulness with which sedimentary foraminifera record the isotopic signature of surface seawater Nd is difficult to explain given their variable and high Nd/Ca ratios, ratios that are often sedimentary foraminifera, ratios that are often much higher than is plausible for direct incorporation within the calcite structure. We present further data that demonstrate a similarly large range in Nd/Ca ratios in plankton tow foraminifera, a range that may be controlled by redox conditions in the water column. Cleaning experiments reveal, in common with earlier work, that large amounts of Nd are released by cleaning with both hydrazine and diethylene triamine penta-acetic acid, but that the Nd released at each step is of surface origin. While further detailed studies are required to verify the exact location of the surface isotopic signature and the key controls on foraminiferal Nd isotope systematics, these new data place the use of planktonic foraminifera as recorders of surface water Nd isotope ratios, and thus of variations in the past supply of Nd to the oceans from the continents via weathering and erosion, on a reasonably sure footing.
Chlorophyll-a Standard Deviation of Long-Term Mean, 1998-2018 - American...
catalog.data.gov
data.ioos.us
+1more
Updated Dec 27, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NOAA Pacific Islands Fisheries Science Center (PIFSC) (Point of Contact) (2024). Chlorophyll-a Standard Deviation of Long-Term Mean, 1998-2018 - American Samoa [Dataset]. https://catalog.data.gov/dataset/chlorophyll-a-standard-deviation-of-long-term-mean-1998-2018-american-samoa
Explore at:
Dataset updated
Dec 27, 2024
Dataset provided by
National Oceanic and Atmospheric Administrationhttp://www.noaa.gov/
Area covered
American Samoa
Description
Chlorophyll-a, is a widely used proxy for phytoplankton biomass and an indicator for changes in phytoplankton production. As an essential source of energy in the marine environment, the extent and availability of phytoplankton biomass can be highly influential for fisheries production and dictate trophic structure in marine ecosystems. Changes in phytoplankton biomass are predominantly effected by changes in nutrient availability, through either natural (e.g., turbulent ocean mixing) or anthropogenic (e.g., agricultural runoff) processes. This layer represents the standard deviation of the 8-day time series of chlorophyll-a (mg/m3) from 1998-2018. Data products generated by the Ocean Colour component of the European Space Agency (ESA) Climate Change Initiative (CCI) project. These files are 8-day 4-km composites of merged sensor products: Global Area Coverage (GAC), Local Area Coverage (LAC), MEdium Resolution Imaging Spectrometer (MERIS), Moderate Resolution Imaging Spectroradiometer (MODIS) Aqua, Ocean and Land Colour Instrument (OLCI), Sea-viewing Wide Field-of-view Sensor (SeaWiFS), and Visible Infrared Imaging Radiometer Suite (VIIRS). The standard deviation was calculated over all 8-day chlorophyll-a data from 1998-2018 for each pixel. A quality control mask was applied to remove spurious data associated with shallow water, following Gove et al., 2013. Nearshore map pixels with no data were filled with values from the nearest neighboring valid offshore pixel by using a grid of points and the Near Analysis tool in ArcGIS then converting points to raster. Data source: https://oceanwatch.pifsc.noaa.gov/erddap/griddap/esa-cci-chla-8d-v5-0.graph
Photosynthetically Active Radiation (PAR) Standard Deviation of Long-term...
catalog.data.gov
data.ioos.us
Updated Dec 26, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NOAA Pacific Islands Fisheries Science Center (PIFSC) (Point of Contact) (2024). Photosynthetically Active Radiation (PAR) Standard Deviation of Long-term Mean, 2003-2018 - American Samoa [Dataset]. https://catalog.data.gov/dataset/photosynthetically-active-radiation-par-standard-deviation-of-long-term-mean-2003-2018-american
Explore at:
Dataset updated
Dec 26, 2024
Dataset provided by
National Oceanic and Atmospheric Administrationhttp://www.noaa.gov/
Area covered
American Samoa
Description
Solar irradiance is one of the most important factors influencing coral reefs. As a majority of their nutrients are obtained from symbiotic photosynthesizing organisms, reef-building corals need sunlight as a fundamental source of energy. Seasonally low irradiance at high latitudes may be linked to reduced growth rates in corals and may limit reef calcification to shallower depths than that observed at lower latitudes. However, high levels of irradiance can lead to light-induced damage, production of free radicals, and in combination with increased temperatures, can exacerbate coral bleaching. Irradiance is here represented by PAR (photosynthetically active radiation), which is the spectrum of light that is important for photosynthesis. This layer represents the standard deviation of the 8-day time series of PAR (mol/m2/day) from 2003-2018. Data for PAR for the time period 2003-2018 were obtained from the Moderate Resolution Imaging Spectroradiometer (MODIS) Aqua satellite instrument from the NASA OceanColor website as 8-day 4-km composites. The standard deviation of the long-term mean of PAR was calculated by taking the standard deviation over all 8-day data from 2003-2018 for each pixel. A quality control mask was applied to remove spurious data associated with shallow water, following Gove et al., 2013. Nearshore map pixels with no data were filled with values from the nearest neighboring valid offshore pixel by using a grid of points and the Near Analysis tool in ArcGIS then converting points to raster. Data source: https://oceanwatch.pifsc.noaa.gov/erddap/griddap/aqua_par_8d_2018_0.graph
N
Median Household Income Variation by Family Size in Clear Lake, IA:...
neilsberg.com
csv, json
Updated Jan 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neilsberg Research (2024). Median Household Income Variation by Family Size in Clear Lake, IA: Comparative analysis across 7 household sizes [Dataset]. https://www.neilsberg.com/research/datasets/1ac86eba-73fd-11ee-949f-3860777c1fe6/
Explore at:
json, csvAvailable download formats
Dataset updated
Jan 11, 2024
Dataset authored and provided by
Neilsberg Research
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Iowa, Clear Lake
Variables measured
Household size, Median Household Income
Measurement technique
The data presented in this dataset is derived from the U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates. It delineates income distributions across 7 household sizes (mentioned above) following an initial analysis and categorization. Using this dataset, you can find out how household income varies with the size of the family unit. For additional information about these estimations, please contact us via email at research@neilsberg.com
Dataset funded by
Neilsberg Research
Description
About this dataset

Context

The dataset presents median household incomes for various household sizes in Clear Lake, IA, as reported by the U.S. Census Bureau. The dataset highlights the variation in median household income with the size of the family unit, offering valuable insights into economic trends and disparities within different household sizes, aiding in data analysis and decision-making.

Key observations

Of the 7 household sizes (1 person to 7-or-more person households) reported by the census bureau, Clear Lake did not include 5-person households. Across the different household sizes in Clear Lake the mean income is $129,811, and the standard deviation is $81,180. The coefficient of variation (CV) is 62.54%. This high CV indicates high relative variability, suggesting that the incomes vary significantly across different sizes of households.

In the most recent year, 2021, The smallest household size for which the bureau reported a median household income was 1-person households, with an income of $38,324. It then further increased to $170,800 for 7-person households, the largest household size for which the bureau reported a median household income.

https://i.neilsberg.com/ch/clear-lake-ia-median-household-income-by-household-size.jpeg" alt="Clear Lake, IA median household income, by household size (in 2022 inflation-adjusted dollars)">

Content

When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates.

Household Sizes:

1-person households

2-person households

3-person households

4-person households

5-person households

6-person households

7-or-more-person households

Variables / Data Columns

Household Size: This column showcases 7 household sizes ranging from 1-person households to 7-or-more-person households (As mentioned above).

Median Household Income: Median household income, in 2022 inflation-adjusted dollars for the specific household size.

Good to know

Margin of Error

Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

Custom data

If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

Inspiration

Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

Recommended for further research

This dataset is a part of the main dataset for Clear Lake median household income. You can refer the same here
Z
Dataset from: High consistency and repeatability in the breeding migrations...
data.niaid.nih.gov
zenodo.org
Updated Jun 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous (2024). Dataset from: High consistency and repeatability in the breeding migrations of a benthic shark [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11467088
Explore at:
Dataset updated
Jun 4, 2024
Authors
Anonymous
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset and scripts used for manuscript: High consistency and repeatability in the breeding migrations of a benthic shark.

Project title: High consistency and repeatability in the breeding migrations of a benthic sharkDate:23/04/2024

Folders:- 1_Raw_data - Perpendicular_Point_068151, Sanctuary_Point_068088, SST raw data, sst_nc_files, IMOS_animal_measurements, IMOS_detections, PS&Syd&JB tags, rainfall_raw, sample_size, Point_Perpendicular_2013_2019, Sanctuary_Point_2013_2019, EAC_transport- 2_Processed_data - SST (anomaly, historic_sst, mean_sst_31_years, week_1992_sst:week_2022_sst including week_2019_complete_sst) - Rain (weekly_rain, weekly_rainfall_completed) - Clean (clean, cleaned_data, cleaned_gam, cleaned_pj_data)- 3_Script_processing_data - Plots(dual_axis_plot (Fig. 1 & Fig. 4).R, period_plot (Fig. 2).R, sd_plot (Fig. 5).R, sex_plot (Fig. 3).R - cleaned_data.R, cleaned_data_gam.R, weekly_rainfall_completed.R, descriptive_stats.R, sst.R, sst_2019b.R, sst_anomaly.R- 4_Script_analyses - gam.R, gam_eac.R, glm.R, lme.R, Repeatability.R- 5_Output_doc - Plots (arrival_dual_plot_with_anomaly (Fig. 1).png, period_plot (Fig.2).png, sex_arrival_departure (Fig. 3).png, departure_dual_plot_with_anomaly (Fig. 4).png, standard deviation plot (Fig. 5).png) - Tables (gam_arrival_eac_selection_table.csv (Table S2), gam_departure_eac_selection_table (Table S5), gam_arrival_selection_table (Table. S3), gam_departure_selection_table (Table. S6), glm_arrival_selection_table, glm_departure_selection_table, lme_arrival_anova_table, lme_arrival_selection_table (Table S4), lme_departure_anova_table, lme_departure_selection_table (Table. S8))

Descriptions of scripts and files used:- cleaned_data.R: script to extract detections of sharks at Jervis Bay. Calculate arrival and departure dates over the seven breeding seasons. Add sex and length for each individual. Extract moon phase (numerical value) and period of the day from arrival and departure times. - IMOS_detections.csv: raw data file with detections of Port Jackson sharks over different sites in Australia. - IMOS_animal_measurements.csv: raw data file with morphological data of Port Jackson sharks - PS&Syd&JB tags: file with measurements and sex identification of sharks (different from IMOS, it was used to complete missing sex and length). - cleaned_data.csv: file with arrival and departure dates of the final sample size of sharks (N=49) with missing sex and length for some individuals. - clean.csv: completed file using PS&Syd&JB tags, note: tag ID 117393679 was wrongly identified as a male in IMOS and correctly identified as a female in PS&Syd&JB tags file as indicated by its large size. - cleaned_pj_data: Final data file with arrival and departure dates, sex, length, moon phase (numerical) and period of the day.

weekly_rainfall_completed.R: script to calculate average weekly rainfall and correlation between the two weather stations used (Point perpendicular and Sanctuary point). - weekly_rain.csv: file with the corresponding week number (1-28) for each date (01-06-2013 to 13-12-2019) - weekly_rainfall_completed.csv: file with week number (1-28), year (2013-2019) and weekly rainfall average completed with Sanctuary Point for week 2 of 2017 - Point_Perpendicular_2013_2019: Rainfall (mm) from 01-01-2013 to 31-12-2020 at the Point Perpendicular weather station - Sanctuary_Point_2013_2019: Rainfall (mm) from 01-01-2013 to 31-12-2020 at the Sanctuary Point weather station - IDCJAC0009_068088_2017_Data.csv: Rainfall (mm) from 01-01-2017 to 31-12-2017 at the Sanctuary Point weather station (to fill in missing value for average rainfall of week 2 of 2017)

cleaned_data_gam.R: script to calculate weekly counts of sharks to run gam models and add weekly averages of rainfall and sst anomaly - cleaned_pj_data.csv - anomaly.csv: weekly (1-28) average sst anomalies for Jervis Bay (2013-2019) - weekly_rainfall_completed.csv: weekly (1-28) average rainfall for Jervis Bay (2013-2019_ - sample_size.csv: file with the number of sharks tagged (13-49) for each year (2013-2019)

sst.R: script to extract daily and weekly sst from IMOS nc files from 01-05 until 31-12 for the following years: 1992:2022 for Jervis Bay - sst_raw_data: folder with all the raw weekly (1:28) csv files for each year (1992:2022) to fill in with sst data using the sst script - sst_nc_files: folder with all the nc files downloaded from IMOS from the last 31 years (1992-2022) at the sensor (IMOS - SRS - SST - L3S-Single Sensor - 1 day - night time – Australia). - SST: folder with the average weekly (1-28) sst data extracted from the nc files using the sst script for each of the 31 years (to calculate temperature anomaly).

sst_2019b.R: script to extract daily and weekly sst from IMOS nc file for 2019 (missing value for week 19) for Jervis Bay - week_2019_sst: weekly average sst 2019 with a missing value for week 19 - week_2019b_sst: sst data from 2019 with another sensor (IMOS – SRS – MODIS - 01 day - Ocean Colour-SST) to fill in the gap of week 19 - week_2019_complete_sst: completed average weekly sst data from the year 2019 for weeks 1-28.

sst_anomaly.R: script to calculate mean weekly sst anomaly for the study period (2013-2019) using mean historic weekly sst (1992-2022) - historic_sst.csv: mean weekly (1-28) and yearly (1992-2022) sst for Jervis Bay - mean_sst_31_years.csv: mean weekly (1-28) sst across all years (1992-2022) for Jervis Bay - anomaly.csv: mean weekly and yearly sst anomalies for the study period (2013-2019)

Descriptive_stats.R: script to calculate minimum and maximum length of sharks, mean Julian arrival and departure dates per individual per year, mean Julian arrival and departure dates per year for all sharks (Table. S10), summary of standard deviation of julian arrival dates (Table. S9) - cleaned_pj_data.csv

gam.R: script used to run the Generalized additive model for rainfall and sea surface temperature - cleaned_gam.csv

glm.R: script used to run the Generalized linear mixed models for the period of the day and moon phase - cleaned_pj_data.csv - sample_size.csv

lme.R: script used to run the Linear mixed model for sex and size - cleaned_pj_data.csv

Repeatability.R: script used to run the Repeatability for Julian arrival and Julian departure dates - cleaned_pj_data.csv
Salaries case study
kaggle.com
zip
Updated Oct 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shobhit Chauhan (2024). Salaries case study [Dataset]. https://www.kaggle.com/datasets/satyam0123/salaries-case-study
Explore at:
zip(13105509 bytes)Available download formats
Dataset updated
Oct 2, 2024
Authors
Shobhit Chauhan
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
To analyze the salaries of company employees using Pandas, NumPy, and other tools, you can structure the analysis process into several steps:

Case Study: Employee Salary Analysis In this case study, we aim to analyze the salaries of employees across different departments and levels within a company. Our goal is to uncover key patterns, identify outliers, and provide insights that can support decisions related to compensation and workforce management.

Step 1: Data Collection and Preparation Data Sources: The dataset typically includes employee ID, name, department, position, years of experience, salary, and additional compensation (bonuses, stock options, etc.). Data Cleaning: We use Pandas to handle missing or incomplete data, remove duplicates, and standardize formats. Example: df.dropna() to handle missing salary information, and df.drop_duplicates() to eliminate duplicate entries. Step 2: Data Exploration and Descriptive Statistics Exploratory Data Analysis (EDA): Using Pandas to calculate basic statistics such as mean, median, mode, and standard deviation for employee salaries. Example: df['salary'].describe() provides an overview of the distribution of salaries. Data Visualization: Leveraging tools like Matplotlib or Seaborn for visualizing salary distributions, box plots to detect outliers, and bar charts for department-wise salary breakdowns. Example: sns.boxplot(x='department', y='salary', data=df) provides a visual representation of salary variations by department. Step 3: Analysis Using NumPy Calculating Salary Ranges: NumPy can be used to calculate the range, variance, and percentiles of salary data to identify the spread and skewness of the salary distribution. Example: np.percentile(df['salary'], [25, 50, 75]) helps identify salary quartiles. Correlation Analysis: Identify the relationship between variables such as experience and salary using NumPy to compute correlation coefficients. Example: np.corrcoef(df['years_of_experience'], df['salary']) reveals if experience is a significant factor in salary determination. Step 4: Grouping and Aggregation Salary by Department and Position: Using Pandas' groupby function, we can summarize salary information for different departments and job titles to identify trends or inequalities. Example: df.groupby('department')['salary'].mean() calculates the average salary per department. Step 5: Salary Forecasting (Optional) Predictive Analysis: Using tools such as Scikit-learn, we could build a regression model to predict future salary increases based on factors like experience, education level, and performance ratings. Step 6: Insights and Recommendations Outlier Identification: Detect any employees earning significantly more or less than the average, which could signal inequities or high performers. Salary Discrepancies: Highlight any salary discrepancies between departments or gender that may require further investigation. Compensation Planning: Based on the analysis, suggest potential changes to the salary structure or bonus allocations to ensure fair compensation across the organization. Tools Used: Pandas: For data manipulation, grouping, and descriptive analysis. NumPy: For numerical operations such as percentiles and correlations. Matplotlib/Seaborn: For data visualization to highlight key patterns and trends. Scikit-learn (Optional): For building predictive models if salary forecasting is included in the analysis. This approach ensures a comprehensive analysis of employee salaries, providing actionable insights for human resource planning and compensation strategy.
Initial Sample of HYPERNETS Hyperspectral Surface Reflectance Measurements...
data.europa.eu
data.niaid.nih.gov
+1more
unknown
Updated Jun 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2023). Initial Sample of HYPERNETS Hyperspectral Surface Reflectance Measurements for Satellite Validation from the Barrax Site in Spain [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-8060798?locale=hr
Explore at:
unknown(2682)Available download formats
Dataset updated
Jun 19, 2023
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Spain
Description
The HYPERNETS project (www.hypernets.eu) has the overall aim to ensure that high quality in situ measurements are available to support the (VNIR/SWIR) optical Copernicus products. Therefore, it established a new autonomous hyperspectral spectroradiometer (HYPSTAR® - www.hypstar.eu) dedicated to land and water surface reflectance validation with instrument pointing capabilities. In the prototype phase, the instrument is being deployed at 24 sites covering a range of water and land types and a range of climatic and logistic conditions. This dataset provides the first published data for the Gobabeb HYPERNETS site in Barrax, Spain (BASP). It is a subset of the complete data record which consists of the best quality BASP measurements which could be used for satellite validation over the three day test deployment period. The provided NetCDF files are the L2A hypernets products with surface reflectances, their associated uncertainties and error-correlation information. The reflectance in the L2A products is the Hemispherical-directional Reflectance Factor (HDRF) defined as: HDRF = π L / E where L is the directional upwelling radiance (with field of view of 5 degrees) and E is the (hemispherical) downwelling irradiance (i.e. including both direct solar and diffuse sky irradiance). These reflectances have dimensions of wavelength and series, where each series is a set of measurements for a given geometry (combination of viewing zenith and azimuth angle). In addition to variables for wavelength and bandwidth, the files also contain variables that provide for each series the acquisition time, viewing and solar angles, number of valid scans used, and quality flags (typically no flags are set in the data provided in this dataset). These NetCDF files also contain further relevant metadata as attributes. See https://hypernets-processor.readthedocs.io/ for further info. The BASP site was a temporary installation over the period of the 20th – 22nd July 2022 during the Surface Reflectance Intercomparison eXperiment (SRIX) campaign (https://frm4veg.org/srix4veg/) at the Las Tiesas experimental farm in Barrax, Spain. This location was selected due to its typical clear skies, flat terrain, and well-managed crops. The HYPSTAR®-XR (eXtended Range) was deployed in a small corn field next to the ongoing UAV experiment. The instrument was deployed on a 3.5m high pole with a short extended boom at 1.3m height from the crops, with measurements running every 30 minutes throughout the day (UTC+2) and measuring between viewing zenith angles of 0-60 degrees. The HYPSTAR®-XR instruments deployed at each land HYPERNETS site consist of a VNIR and a SWIR sensor and autonomously collect data between 380-1700 nm at various viewing geometries and send it to a central server for quality control and processing. The VNIR sensor spans 1330 channels between 380 and 1000 nm with a FWHM of 3 nm and the SWIR sensor has 220 channels between 1000 and 1700 nm with a FWHM of 10 nm. The hypernets_processor (Goyens et al. 2021; De Vis et al. in prep.) automatically processes all this data into various products, including the L2A surface reflectance product provided here. All of the products have associated uncertainties (divided into random and systematic uncertainties, including error-correlation information) which were propagated using the CoMet toolkit (www.comet-toolkit.org). To obtain this dataset, we start from the full BASP data record and omit all the data that do not pass all of the quality checks performed as part of the hypernets_processor. In addition, an additional screening procedure was developed to remove outliers and only supply the best quality data suitable for satellite validation. To remove the outliers, a sigma-clipping method is used. First reflectances are extracted in separate 2 hour windows throughout the day (to account for BRDF differences due to different solar position) for 4 different wavelengths (500, 900, 1100 and 1600 nm). Outliers in these reflectances are then identified by iteratively calculating the mean reflectance trend with time (by binning the data per maximum 30 data points), calculating the standard deviation from this trend, and masking any data that is more than 3 standard deviations away from the trend. This process is repeated on the unmasked data until the standard deviation does not vary by more than 5% between two iterations. The masks for the 4 different wavelengths are then combined (keeping only measurements for which none of the 4 wavelengths is an outlier). The reflectances and associated uncertainties for any masked series (i.e. a geometry that is masked either by the sigma-clipping procedure or from the masks of the hypernets_processor) are replaced by NaNs. Any sequence that has more than half of its series masked is removed entirely. For BASP specifically, viewing zenith angles above 30 degrees have been removed, as well as any west-facing angles azimuth angles of 263,273 or 293 degrees) for viewi
Data from: (Table 1) Concentration of organohalogenated compounds in clean...
search.datacite.org
doi.pangaea.de
+1more
Updated 2010
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
datacite (2010). (Table 1) Concentration of organohalogenated compounds in clean and contaminated hair of polar bears (Ursus maritimus) from East Greenland [Dataset]. http://doi.org/10.1594/pangaea.811549
Explore at:
Unique identifier
https://doi.org/10.1594/pangaea.811549
Dataset updated
2010
Dataset provided by
DataCitehttps://www.datacite.org/
PANGAEA
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Area covered

Description
In this pilot study, we report on levels of persistent organohalogenated contaminants (OHCs) in hair of polar bears (Ursus maritimus) from East Greenland sampled between 1999 and 2001. To our knowledge, this is the first study on the validation of polar bear hair as a non-invasive matrix representative of concentrations and profiles in internal organs and blood plasma. Because of low sample weights (13-140 mg), only major bioaccumulative OHCs were detected above the limit of quantification: five polychlorinated biphenyl (PCB) congeners (CB 99, 138, 153, 170 and 180), one polybrominated diphenyl ether (PBDE) congener (BDE 47), oxychlordane, trans-nonachlor and ß-hexachlorocyclohexane. The PCB profile in hair was similar to that of internal tissues (i.e. adipose, liver, brain and blood), with CB 153 and 180 as the major congeners in all matrices. A gender difference was found for concentrations in hair relative to concentrations in internal tissues. Females (n = 6) were found to display negative correlations, while males (n = 5) showed positive correlations, although p-values were not found significant. These negative correlations in females may reflect seasonal OHC mobilisation from periphery adipose tissue due to, for example, lactation and fasting. The lack of significance in most correlations may be due to small sample sizes and seasonal variability of concentrations in soft tissues. Further research with larger sample weights and sizes is therefore necessary to draw more definitive conclusions on the usefulness of hair for biomonitoring OHCs in polar bears and other fur mammals.
n
Data from: Low-cost, local production of a safe and effective disinfectant...
data.niaid.nih.gov
search.dataone.org
+1more
zip
Updated Jun 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrea Naranjo-Soledad; Logan Smesrud; Siva RS Bandaru; Dana Hernandez; Meire Mehare; Sara Mahmoud; Vijay Matange; Bakul Rao; Chandana N; Paige Balcom; David Omole; Cesar Alvarez-Mejia; Varinia Lopez-Ramirez; Ashok Gadgil (2024). Low-cost, local production of a safe and effective disinfectant for resource-constrained communities [Dataset]. http://doi.org/10.5061/dryad.2547d7wz5
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.2547d7wz5
Dataset updated
Jun 21, 2024
Dataset provided by
University of California, Berkeley
VINYS Architects
Gulu University
Covenant University
Tecnológico Nacional de México
Indian Institute of Technology Bombay
Indian Institute of Technology Jodhpur
Authors
Andrea Naranjo-Soledad; Logan Smesrud; Siva RS Bandaru; Dana Hernandez; Meire Mehare; Sara Mahmoud; Vijay Matange; Bakul Rao; Chandana N; Paige Balcom; David Omole; Cesar Alvarez-Mejia; Varinia Lopez-Ramirez; Ashok Gadgil
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Improved hygiene depends on the accessibility and availability of effective disinfectant solutions. These disinfectant solutions are unavailable to many communities worldwide due to resource limitations, among other constraints. Safe and effective chlorine-based disinfectants can be produced via simple electrolysis of salt water, providing a low-cost and reliable option for on-site, local production of disinfectant solutions to improve sanitation and hygiene. This study reports on a system (herein called “Electro-Clean”) that can produce concentrated solutions of hypochlorous acid (HOCl) using readily available, low-cost materials. With just table salt, water, graphite welding rods, and a DC power supply, the Electro-Clean system can safely produce HOCl solutions (~1.5 liters) of up to 0.1% free chlorine (i.e.,1000 ppm) in less than two hours at low potential (5 V DC) and modest current (~5 A). Rigorous testing of free chlorine production and durability of the Electro-Clean system components, described here, has been verified to work in multiple locations around the world, including microbiological tests conducted in India and Mexico to confirm the biocidal efficacy of the Electro-Clean solution as a surface disinfectant. Cost estimates are provided for making HOCl locally with this method in the USA, India, and Mexico. Findings indicate that Electro-Clean is an affordable alternative to off-the-shelf commercial chlorinator systems in terms of first costs (or capital costs), and cost-competitive relative to the unit cost of the disinfectant produced. By minimizing dependence on supply chains and allowing for local production, the Electro-Clean system has the potential to improve public health by addressing the need for disinfectant solutions in resource-constrained communities. Methods We conducted chemical experiments in a laboratory setting, performing each experiment in triplicate unless otherwise specified. The dataset presented here includes the raw data from these experiments. We used Excel to record the data and calculate the average and standard deviation. The file names correspond to the figures or tables in the manuscript or the Supporting Information Appendices. Detailed descriptions of the experimental methods can be found in the main manuscript.
Correlations (above diagonal), standard deviations (diagonal) and...
plos.figshare.com
xls
Updated May 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lara Lusa; Cécile Proust-Lima; Carsten O. Schmidt; Katherine J. Lee; Saskia le Cessie; Mark Baillie; Frank Lawrence; Marianne Huebner (2024). Correlations (above diagonal), standard deviations (diagonal) and covariances (below diagonal) of grip strength across waves for males. [Dataset]. http://doi.org/10.1371/journal.pone.0295726.t006
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0295726.t006
Dataset updated
May 29, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Lara Lusa; Cécile Proust-Lima; Carsten O. Schmidt; Katherine J. Lee; Saskia le Cessie; Mark Baillie; Frank Lawrence; Marianne Huebner
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Correlations (above diagonal), standard deviations (diagonal) and covariances (below diagonal) of grip strength across waves for males.
Detailed information of the observation datasets.
plos.figshare.com
datasetcatalog.nlm.nih.gov
xls
Updated May 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Weidong Ji; Rongfu Li; Wenfei Xue; Zhigang Cao; Hongying Yang; Qiaozhen Ning; Xiaokai Hu; Guanghong Liao (2025). Detailed information of the observation datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0317751.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0317751.t001
Dataset updated
May 9, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Weidong Ji; Rongfu Li; Wenfei Xue; Zhigang Cao; Hongying Yang; Qiaozhen Ning; Xiaokai Hu; Guanghong Liao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
There are abundant wind energy resources along the coast of China. Understanding spatial-temporal characteristics of wind speed is significant in meteorology, coastal engineering design and maritime industries. Reliable wind products such as reanalysis data, coupled with accurate wind speed measurements, are essential for elucidating the primary characteristics of the wind field. In this study, we evaluated hourly 10 m and 100 m wind speed data from the fifth-generation ECMWF atmospheric reanalysis (ERA5) by comparing it with direct wind measurements obtained from 19 wind tower located across the coastal waters of China. The results are as follows: 1) the basic statistical characteristic between ERA5 reanalysis and observed wind speeds demonstrate good consistency. However, the ERA5 tends to underestimate wind speed, particularly at high speeds during extreme conditions. 2) Compare ERA5 data with observations from each station using a frequency distribution-based score method, hourly scores of most stations are between 0.8 to 0.9. It shows the higher simulation skill in the northern region than the southern due to the influence of high-frequency typhoon in the South China Sea. 3)Distribution function parameters, mean values, variability, and wind threshold frequencies were analyzed for this ensemble of observation, providing an overall description of wind characteristics. Generally speaking, there is no clear linear relationship between scores and the other variables. On longer time scales (6–24 hours), the score and correlation between ERA5 and observations further increased, while the centered root-mean-square error (CRMSE) and standard deviation decrease. 4) Hourly wind data with a regular spatial distribution in ERA5 reanalysis provides valuable information for further detailed research on meteorology or renewable energy perspectives, but some inherent shortcomings should be considered.
Z
#PraCegoVer dataset
data.niaid.nih.gov
data-staging.niaid.nih.gov
Updated Jan 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gabriel Oliveira dos Santos; Esther Luna Colombini; Sandra Avila (2023). #PraCegoVer dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5710561
Explore at:
Dataset updated
Jan 19, 2023
Dataset provided by
Institute of Computing, University of Campinas
Authors
Gabriel Oliveira dos Santos; Esther Luna Colombini; Sandra Avila
Description
Automatically describing images using natural sentences is an essential task to visually impaired people's inclusion on the Internet. Although there are many datasets in the literature, most of them contain only English captions, whereas datasets with captions described in other languages are scarce.

PraCegoVer arose on the Internet, stimulating users from social media to publish images, tag #PraCegoVer and add a short description of their content. Inspired by this movement, we have proposed the #PraCegoVer, a multi-modal dataset with Portuguese captions based on posts from Instagram. It is the first large dataset for image captioning in Portuguese with freely annotated images.

PraCegoVer has 533,523 pairs with images and captions described in Portuguese collected from more than 14 thousand different profiles. Also, the average caption length in #PraCegoVer is 39.3 words and the standard deviation is 29.7.

Dataset Structure

PraCegoVer dataset is composed of the main file dataset.json and a collection of compressed files named images.tar.gz.partX

containing the images. The file dataset.json comprehends a list of json objects with the attributes:

user: anonymized user that made the post;

filename: image file name;

raw_caption: raw caption;

caption: clean caption;

date: post date.

Each instance in dataset.json is associated with exactly one image in the images directory whose filename is pointed by the attribute filename. Also, we provide a sample with five instances, so the users can download the sample to get an overview of the dataset before downloading it completely.

Download Instructions

If you just want to have an overview of the dataset structure, you can download sample.tar.gz. But, if you want to use the dataset, or any of its subsets (63k and 173k), you must download all the files and run the following commands to uncompress and join the files:

cat images.tar.gz.part* > images.tar.gz tar -xzvf images.tar.gz

Alternatively, you can download the entire dataset from the terminal using the python script download_dataset.py available in PraCegoVer repository. In this case, first, you have to download the script and create an access token here. Then, you can run the following command to download and uncompress the image files:

python download_dataset.py --access_token=
E
[GA01 - Dissolved Pb] - Dissolved lead data collected from the R/V Pourquoi...
erddap.bco-dmo.org
Updated Mar 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BCO-DMO (2020). [GA01 - Dissolved Pb] - Dissolved lead data collected from the R/V Pourquoi pas (GEOVIDE) in the North Atlantic, Labrador Sea (section GA01) during 2014 (Filling Gaps in the Atlantic and Pacific Pb and Pb Isotope Spatial and Temporal Evolution) [Dataset]. https://erddap.bco-dmo.org/erddap/info/bcodmo_dataset_651880/index.html
Explore at:
Dataset updated
Mar 24, 2020
Dataset provided by
Biological and Chemical Oceanographic Data Management Office (BCO-DMO)
Authors
BCO-DMO
License
https://www.bco-dmo.org/dataset/651880/licensehttps://www.bco-dmo.org/dataset/651880/license
Area covered

Variables measured
BTLNBR, CASTNO, CTDPRS, SAMPNO, STNNBR, SECT_ID, BTL_DATE, latitude, PI_SAMPNO, cruise_id, and 5 more
Description
Dissolved lead data collected from the R/V Pourquoi pas (GEOVIDE) in the North Atlantic, Labrador Sea (section GA01) during 2014 access_formats=.htmlTable,.csv,.json,.mat,.nc,.tsv,.esriCsv,.geoJson acquisition_description=Sample storage bottle lids and threads were soaked overnight in 2N reagent grade HCl, then filled with 1N reagent grade HCl to be heated in an oven at 60 degrees celcius\u00a0overnight, inverted, heated for a second day, and rinsed 5X with pure distilled water.\u00a0 The bottles were then filled with trace metal clean dilute HCl (0.01N HCl) and again heated in the oven for one day on either end.\u00a0 Clean sample bottles were emptied, and double-bagged prior to rinsing and filling with sample.

As stated in the cruise report, trace metal clean seawater samples were collected using the French GEOTRACES clean rosette (General Oceanics Inc. Model 1018 Intelligent Rosette), equipped with twenty-two new 12L GO-FLO bottles (two bottles were leaking and were never deployed during the cruise). The 22 new GO-FLO bottles were initially cleaned in LEMAR laboratory following the GEOTRACES procedures (Cutter and Bruland, 2012). The rosette was deployed on a 6mm Kevlar cable with a dedicated custom designed clean winch. Immediately after recovery, GO-FLO bottles were individually covered at each end with plastic bags to minimize contamination. They were then transferred into a clean container (class-100) for sampling. On each trace metal cast, nutrient and/or salinity samples were taken to check potential leakage of the Go-Flo bottles. Prior to filtration, GO-FLO bottles were mixed manually three times. GO-FLO bottles were pressurized to less than\u00a08 psi with 0.2-um filtered N2\u00a0(Air Liquide). For Stations 1, 11, 15, 17, 19, 21, 25, 26, 29, 32 GO-FLO spigots were fitted with an acid-cleaned piece of Bev-a-Line tubing that fed into a 0.2 um capsule filters (SARTOBRAN\u00a0300, Sartorius). For all other stations (13, 34, 36, 38, 40, 42, 44, 49, 60, 64, 68, 69, 71, 77) seawater was filtered directly through paired filters (Pall Gelman Supor 0.45um polystersulfone, and Millipore mixed ester cellulose MF 5 um) mounted in Swinnex polypropylene filter holders, following the Planquette and Sherrell (2012) method. Filters were cleaned following the protocol described in Planquette and Sherrell (2012) and kept in acid-cleaned 1L LDPE bottles (Nalgene) filled with ultrapure water (Milli-Q, 18.2 megaohm/cm) until use. Subsamples were taken into acid-cleaned (see above) Nalgene HDPE bottles after a triple rinse with the sample. All samples were acidified back in the Boyle laboratory at 2mL per liter seawater (pH 2) with trace metal clean 6N HCl.

On this cruise, only the particulate samples were assigned GEOTRACES numbers. In this dataset, the dissolved Pb samples collected at the same depth (sometimes on a different cast) as the particulate samples have been assigned identifiers as \u201cSAMPNO\u201d which corresponds to the particulate GEOTRACES number. In cases where there were no corresponding particulate samples, a number was generated as \u201cPI_SAMPNO\u201d.

Upon examining the data, we observed that the sample taken from rosette position 1 (usually the near-bottom sample) was always higher in [Pb] than the sample taken immediately above that, and that the excess decreased as the cruise proceeded. The Pb isotope ratio of these samples were higher than the comparison bottles as well. A similar situation was seen for the sample taken from rosette positions 5, 20 and 21 when compared to the depth-interpolated [Pb] from the samples immediately above and below. Also, at two stations where our near-bottom sample was taken from rosette position 2, there was no [Pb] excess over the samples immediately above. We believe that this evidence points to sampler-induced contamination that was being slowly washed out during the cruise, but never completely. So we have flagged all of these analyses with a \u201c3\u201d indicating that we do not believe that these samples should be trusted as reflecting the true ocean [Pb].

In addition, we observed high [Pb] in the samples at Station 1 and very scattered Pb isotope ratios. The majority of these concentrations were far in excess of those values observed at nearby Station 11, and also the nearby USGT10-01. Discussion among other cruise participants revealed similarly anomalous data for other trace metals (e.g., Hg species). After discussion at the 2016 GEOVIDE Workshop, we came to the conclusion that this is*- evidence of GoFlo bottles not having sufficient time to \u201cclean up\u201d prior to use, and that most or all bottles from Station 1 were contaminated. We flagged all Station 1 data with a \u201c3\u201d indicating that we do not believe these values reflect the true ocean [Pb].

Samples were analyzed at least 1 month after acidification over 36 analytical sessions by a resin pre-concentration method. This method utilized the isotope-dilution ICP-MS method described in Lee et al. 2011, which includes pre-concentration on nitrilotriacetate (NTA) resin and analysis on a Fisons PQ2+ using a 400uL/min nebulizer. Briefly, samples were poured into 30mL subsample bottles. Then, triplicate 1.5mL polypropylene vials (Nalgene) were rinsed three times with the 30mL subsample.\u00a0 Each sample was pipetted (1.3mL) from the 30mL subsample to the 1.5mL vial.\u00a0 Pipettes were calibrated daily to the desired volume.\u00a0 25 ul of a 204Pb spike were added to each sample, and the pH was raised to 5.3 using a trace metal clean ammonium acetate buffer, prepared at a pH of between 7.95 and 7.98.\u00a0 2400 beads of NTA Superflow resin (Qiagen Inc., Valencia, CA) were added to the mixture, and the vials were set to shake on a shaker for 3 \u2013 6 days to allow the sample to equilibrate with the resin.\u00a0 After equilibration, the beads were centrifuged and washed 3 times with pure distilled water, using a trace metal clean siphon tip to remove the water wash from the sample vial following centrifugation.\u00a0 After the last wash, 350\u03bcl of a 0.1N solution of trace metal clean HNO3 was added to the resin to elute the metals, and the samples were set to shake on a shaker for 1 \u2013 2 days prior to analysis by ICP-MS.

NTA Superflow resin was cleaned by batch rinsing with 0.1N trace metal clean HCl for a few hours, followed by multiple washes until the pH of the solution was above 4.\u00a0 Resin was stored at 4 degrees celcius\u00a0in the dark until use, though it was allowed to equilibrate to room temperature prior to the addition to the sample.

Nalgene polypropylene (PPCO) vials were cleaned by heated submersion for 2 days at 60 degrees celcius\u00a0in 1N reagent grade HCl, followed by a bulk rinse and 4X individual rinse of each vial with pure distilled water. Each vial was then filled with trace metal clean dilute HCl (0.01N HCl) and heated in the oven at 60 degrees celcius\u00a0for one day on either end.\u00a0 Vials were kept filled until just before usage.

On each day of sample analysis, procedure blanks were determined. Replicates (12) of 300uL of an in-house standard reference material seawater (low Pb surface water) were used, where the amount of Pb in the 300uL was verified as negligible. The procedural blank over the relevant sessions for resin preconcentration method ranged from 2.2 \u2013 9.9pmol/kg, averaging 4.6 +/-\u00a01.7pmol/kg. Within a day, procedure blanks were very reproducible with an average standard deviation of 0.7pmol/kg, resulting in detection limits (3x this standard deviation) of 2.1pmol/kg. Replicate analyses of three different large-volume seawater samples (one with 11pmol/kg, another with 24pmol/kg, and a third with 38pmol/kg) indicated that the precision of the analysis is 4% or 1.6pmol/kg, whichever is larger.

Triplicate analyses of an international reference standard gave SAFe D2: 27.2 +/-\u00a01.7 pmol/kg. However, this standard run was linked into our own long- term quality control standards that are run on every analytical day to maintain long-term consistency. \u00a0

For the most part, the reported numbers are simply as calculated from the isotope dilution equation on the day of the analysis. For some analytical days, however, quality control samples indicated offsets in the blank used to correct the samples. For the upper 5 depths of Station 29, all depths of Station 40, and the deepest 2 depths of Station 42, the quality control samples indicated our blank was overcorrecting by 3.4pM, and we applied a -3.4pM correction to our Pb concentrations for that day. For the deepest 11 depths of Station 34, the quality control samples indicated our blank was overcorrecting by 10.2pM (due to contamination of the low trace metal seawater stock), and we applied a -10.2 pM correction to our Pb concentrations for that day. With these corrections, the overall internal comparability of the Pb collection should be better than 4%.

The errors associated with these Pb concentration measurements are on average 3.2% of the concentration (0.1 \u2013 4.4pmol/kg). Although there was a formal crossover station (1) that overlaps with USGT10-01 (GA-03), sample quality on the first station of GEOVIDE appears problematical making the comparison unhelpful. However, GEOVIDE station 11 (40.33 degrees North, 12.22 degrees\u00a0West) is not too far from USGT10-01 (38.325 degrees\u00a0North, 9.66 degrees\u00a0West) and makes for a reasonable comparison. It should also be noted that the MIT lab has intercalibrated Pb with other labs on the 2008 IC1 cruise, on the 2011 USGT11 (GA-03) cruise, and on the EPZT (GP-16) cruises, and maintains in-lab quality control standards for long-term data quality evaluation.

Ten percent of the samples were analyzed by Rick Kayser and the remaining ninety percent of the samples were analyzed by Cheryl Zurbrick. There was no significant difference between them for the lowest concentration large-volume seawater

Facebook

Twitter

Click to copy link

Link copied

Cite

murtadha najim (2025). Gender Recognition by Voice(processed) [Dataset]. https://www.kaggle.com/datasets/murtadhanajim/vocal-gender-features

Gender Recognition by Voice(processed)

help identifying male and female voice

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jan 18, 2025

Dataset provided by

Kaggle

Authors

murtadha najim

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

This dataset is a cleaned and processed version of raw audio files for gender classification. The features were extracted from .wav audio recordings collected in a quiet room with no background noise. The data contains no null or duplicate values, ensuring a high-quality starting point for analysis and modeling.

Features:

The dataset includes the following extracted audio features:

mean_spectral_centroid: The average spectral centroid, representing the "center of mass" of the spectrum, indicating brightness. std_spectral_centroid: The standard deviation of the spectral centroid, measuring variability in brightness. mean_spectral_bandwidth: The average width of the spectrum, reflecting how spread out the frequencies are. std_spectral_bandwidth: The standard deviation of spectral bandwidth, indicating variability in frequency spread. mean_spectral_contrast: The average difference between peaks and valleys in the spectrum, indicating tonal contrast. mean_spectral_flatness: The average flatness of the spectrum, measuring the noisiness of the signal. mean_spectral_rolloff: The average frequency below which a specified percentage of the spectral energy resides, indicating sharpness. zero_crossing_rate: The rate at which the signal crosses the zero amplitude axis, representing noisiness or percussiveness. rms_energy: The root mean square energy of the signal, reflecting its loudness. mean_pitch: The average pitch frequency of the audio. min_pitch: The minimum pitch frequency. max_pitch: The maximum pitch frequency. std_pitch: The standard deviation of pitch frequency, measuring variability in pitch. spectral_skew: The skewness of the spectral distribution, indicating asymmetry. spectral_kurtosis: The kurtosis of the spectral distribution, indicating the peakiness of the spectrum. energy_entropy: The entropy of the signal energy, representing its randomness. log_energy: The logarithmic energy of the signal, a compressed representation of energy. mfcc_1_mean to mfcc_13_mean: The mean of the first 13 Mel Frequency Cepstral Coefficients (MFCCs), representing the timbral characteristics of the audio. mfcc_1_std to mfcc_13_std: The standard deviation of the first 13 MFCCs, indicating variability in timbral features. label: The target variable indicating the gender male(1) or female(0).

Key Information:

Clean Data: The dataset has been thoroughly cleaned and contains no null or duplicate values. Unscaled: The features are not scaled, allowing users to apply their preferred scaling or normalization techniques. Feature Extraction: The function used for feature extraction is available in the notebook in the Code section. High Performance: The data achieved 95%+ accuracy using machine learning models such as Random Forest, Extra Trees, and K-Nearest Neighbors (KNN). It also performed exceptionally well with neural networks.

Recommendations:

Feature Selection: Avoid using all features in modeling to prevent overfitting. Instead, perform feature selection and choose the most impactful features based on your analysis.

This processed dataset is a reliable and robust foundation for building high-performing models. and if you need any help, you can visit my notebook

Clear search

Close search

Google apps

Main menu

Gender Recognition by Voice(processed)

Features:

Key Information:

Recommendations:

Chlorophyll-a Standard Deviation of Long-Term Mean, 2002-2013 - Hawaii

Data from: Polonium-210 and Lead-210 activities measured on 17 water bottle...

Pakistan House Price dataset

Data from: (Table 3) Contribution of different C species and eCA to...

Median Household Income Variation by Family Size in Clear Lake Township,...

About this dataset

Content

Inspiration

Recommended for further research

Data from: Yield Editor 2.0.7

Data from: Nd-Sr isotopic composition of foraminifera and bulk sediments

Chlorophyll-a Standard Deviation of Long-Term Mean, 1998-2018 - American...

Photosynthetically Active Radiation (PAR) Standard Deviation of Long-term...

Median Household Income Variation by Family Size in Clear Lake, IA:...

About this dataset

Content

Inspiration

Recommended for further research

Dataset from: High consistency and repeatability in the breeding migrations...

Salaries case study

Initial Sample of HYPERNETS Hyperspectral Surface Reflectance Measurements...

Data from: (Table 1) Concentration of organohalogenated compounds in clean...

Data from: Low-cost, local production of a safe and effective disinfectant...

Correlations (above diagonal), standard deviations (diagonal) and...

Detailed information of the observation datasets.

#PraCegoVer dataset

PraCegoVer has 533,523 pairs with images and captions described in Portuguese collected from more than 14 thousand different profiles. Also, the average caption length in #PraCegoVer is 39.3 words and the standard deviation is 29.7.

PraCegoVer dataset is composed of the main file dataset.json and a collection of compressed files named images.tar.gz.partX

[GA01 - Dissolved Pb] - Dissolved lead data collected from the R/V Pourquoi...

Gender Recognition by Voice(processed)

help identifying male and female voice

Features:

Key Information:

Recommendations: