71 datasets found

B
Data Cleaning Sample
borealisdata.ca
Updated Jul 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP3/ZCN177
Dataset updated
Jul 13, 2023
Dataset provided by
Borealis
Authors
Rong Luo
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Sample data for exercises in Further Adventures in Data Cleaning.
Messy Spreadsheet Example for Instruction
zenodo.org
data.niaid.nih.gov
bin
Updated Jun 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Renata Gonçalves Curty; Renata Gonçalves Curty (2024). Messy Spreadsheet Example for Instruction [Dataset]. http://doi.org/10.5281/zenodo.12586563
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.12586563
Dataset updated
Jun 28, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Renata Gonçalves Curty; Renata Gonçalves Curty
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jun 28, 2024
Description
A disorganized toy spreadsheet used for teaching good data organization. Learners are tasked with identifying as many errors as possible before creating a data dictionary and reconstructing the spreadsheet according to best practices.
P
Dirty-MNIST Dataset
paperswithcode.com
opendatalab.com
Updated Jan 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jishnu Mukhoti; Andreas Kirsch; Joost van Amersfoort; Philip H. S. Torr; Yarin Gal (2024). Dirty-MNIST Dataset [Dataset]. https://paperswithcode.com/dataset/dirty-mnist
Explore at:
Dataset updated
Jan 8, 2024
Authors
Jishnu Mukhoti; Andreas Kirsch; Joost van Amersfoort; Philip H. S. Torr; Yarin Gal
Description
DirtyMNIST is a concatenation of MNIST + AmbiguousMNIST, with 60k samples each in the training set. AmbiguousMNIST contains additional ambiguous digits with varying ambiguity. The AmbiguousMNIST test set contains 60k ambiguous samples as well.

Additional Guidance

DirtyMNIST is a concatenation of MNIST + AmbiguousMNIST, with 60k samples each in the training set. The current AmbiguousMNIST contains 6k unique samples with 10 labels each. This multi-label dataset gets flattened to 60k samples. The assumption is that ambiguous samples have multiple "valid" labels as they are ambiguous. MNIST samples are intentionally undersampled (in comparison), which benefits AL acquisition functions that can select unambiguous samples. Pick your initial training samples (for warm starting Active Learning) from the MNIST half of DirtyMNIST to avoid starting training with potentially very ambiguous samples, which might add a lot of variance to your experiments. Make sure to pick your validation set from the MNIST half as well, for the same reason as above. Make sure that your batch acquisition size is >= 10 (probably) given that there are 10 multi-labels per samples in Ambiguous-MNIST. By default, Gaussian noise with stddev 0.05 is added to each sample to prevent acquisition functions (in Active Learning) from cheating by disgarding "duplicates". If you want to split Ambiguous-MNIST into subsets (or Dirty-MNIST within the second ambiguous half), make sure to split by multiples of 10 to avoid splits within a flattened multi-label sample.
Employee Sample Data
kaggle.com
Updated May 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
William Lucas (2023). Employee Sample Data [Dataset]. https://www.kaggle.com/datasets/williamlucas0/employee-sample-data/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 29, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
William Lucas
Description
An unclean employee dataset can contain various types of errors, inconsistencies, and missing values that affect the accuracy and reliability of the data. Some common issues in unclean datasets include duplicate records, incomplete data, incorrect data types, spelling mistakes, inconsistent formatting, and outliers.

For example, there might be multiple entries for the same employee with slightly different spellings of their name or job title. Additionally, some rows may have missing data for certain columns such as bonus or exit date, which can make it difficult to analyze trends or make accurate predictions. Inconsistent formatting of data, such as using different date formats or capitalization conventions, can also cause confusion and errors when processing the data.

Furthermore, there may be outliers in the data, such as employees with extremely high or low salaries or ages, which can distort statistical analyses and lead to inaccurate conclusions.

Overall, an unclean employee dataset can pose significant challenges for data analysis and decision-making, highlighting the importance of cleaning and preparing data before analyzing it
Z
BigMart Retail Sales
data.niaid.nih.gov
Updated May 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataman (2022). BigMart Retail Sales [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6509954
Explore at:
Dataset updated
May 2, 2022
Dataset authored and provided by
Dataman
License
Attribution 1.0 (CC BY 1.0)https://creativecommons.org/licenses/by/1.0/
License information was derived automatically
Description
Nothing ever becomes real till it is experienced.

-John Keats

While we don't know the context in which John Keats mentioned this, we are sure about its implication in data science. While you would have enjoyed and gained exposure to real world problems in this challenge, here is another opportunity to get your hand dirty with this practice problem.

Problem Statement :

The data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store have been defined. The aim is to build a predictive model and find out the sales of each product at a particular store.

Using this model, BigMart will try to understand the properties of products and stores which play a key role in increasing sales.

Please note that the data may have missing values as some stores might not report all the data due to technical glitches. Hence, it will be required to treat them accordingly.

Data :

We have 14204 samples in data set.

Variable Description

Item Identifier: A code provided for the item of sale

Item Weight: Weight of item

Item Fat Content: A categorical column of how much fat is present in the item: ‘Low Fat’, ‘Regular’, ‘low fat’, ‘LF’, ‘reg’

Item Visibility: Numeric value for how visible the item is

Item Type: What category does the item belong to: ‘Dairy’, ‘Soft Drinks’, ‘Meat’, ‘Fruits and Vegetables’, ‘Household’, ‘Baking Goods’, ‘Snack Foods’, ‘Frozen Foods’, ‘Breakfast’, ’Health and Hygiene’, ‘Hard Drinks’, ‘Canned’, ‘Breads’, ‘Starchy Foods’, ‘Others’, ‘Seafood’.

Item MRP: The MRP price of item

Outlet Identifier: Which outlet was the item sold. This will be categorical column

Outlet Establishment Year: Which year was the outlet established

Outlet Size: A categorical column to explain size of outlet: ‘Medium’, ‘High’, ‘Small’.

Outlet Location Type: A categorical column to describe the location of the outlet: ‘Tier 1’, ‘Tier 2’, ‘Tier 3’

Outlet Type: Categorical column for type of outlet: ‘Supermarket Type1’, ‘Supermarket Type2’, ‘Supermarket Type3’, ‘Grocery Store’

Item Outlet Sales: The number of sales for an item.

Evaluation Metric:

We will use the Root Mean Square Error value to judge your response
d
Data from: Water-quality data for a pharmaceutical study at Muddy Creek in...
catalog.data.gov
data.usgs.gov
+1more
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Water-quality data for a pharmaceutical study at Muddy Creek in North Liberty and Coralville, Iowa, 2017-2018 [Dataset]. https://catalog.data.gov/dataset/water-quality-data-for-a-pharmaceutical-study-at-muddy-creek-in-north-liberty-and-cor-2017
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
U.S. Geological Survey
Area covered
Coralville, North Liberty, Muddy Creek, Iowa
Description
Surface-water samples were collected, processed, and analyzed for organics, estrogen equivalents, and fecal indicator bacteria. Filtered organic samples were sent to the National Water Quality Laboratory in Denver, Colorado. Unfiltered estrogen equivalent samples were sent to the Organic Geochemistry Research Lab in Lawrence, Kansas, for extraction, after which they were sent to the National Fish Health Research Laboratory in Leetown, West Virginia. Bacteria samples were processed at the Central-Midwest Water Science Center Iowa City, Iowa, office. Staff collected field parameters in-situ.
N
Muddy Creek Township, Pennsylvania Age Group Population Dataset: A Complete...
neilsberg.com
csv, json
Updated Feb 22, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neilsberg Research (2025). Muddy Creek Township, Pennsylvania Age Group Population Dataset: A Complete Breakdown of Muddy Creek township Age Demographics from 0 to 85 Years and Over, Distributed Across 18 Age Groups // 2025 Edition [Dataset]. https://www.neilsberg.com/research/datasets/4538db12-f122-11ef-8c1b-3860777c1fe6/
Explore at:
json, csvAvailable download formats
Dataset updated
Feb 22, 2025
Dataset authored and provided by
Neilsberg Research
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Muddy Creek Township, Pennsylvania
Variables measured
Population Under 5 Years, Population over 85 years, Population Between 5 and 9 years, Population Between 10 and 14 years, Population Between 15 and 19 years, Population Between 20 and 24 years, Population Between 25 and 29 years, Population Between 30 and 34 years, Population Between 35 and 39 years, Population Between 40 and 44 years, and 9 more
Measurement technique
The data presented in this dataset is derived from the latest U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates. To measure the two variables, namely (a) population and (b) population as a percentage of the total population, we initially analyzed and categorized the data for each of the age groups. For age groups we divided it into roughly a 5 year bucket for ages between 0 and 85. For over 85, we aggregated data into a single group for all ages. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
Dataset funded by
Neilsberg Research
Description
About this dataset

Context

The dataset tabulates the Muddy Creek township population distribution across 18 age groups. It lists the population in each age group along with the percentage population relative of the total population for Muddy Creek township. The dataset can be utilized to understand the population distribution of Muddy Creek township by age. For example, using this dataset, we can identify the largest age group in Muddy Creek township.

Key observations

The largest age group in Muddy Creek Township, Pennsylvania was for the group of age 15 to 19 years years with a population of 193 (9.08%), according to the ACS 2019-2023 5-Year Estimates. At the same time, the smallest age group in Muddy Creek Township, Pennsylvania was the 80 to 84 years years with a population of 11 (0.52%). Source: U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates

Content

When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates

Age groups:

Under 5 years

5 to 9 years

10 to 14 years

15 to 19 years

20 to 24 years

25 to 29 years

30 to 34 years

35 to 39 years

40 to 44 years

45 to 49 years

50 to 54 years

55 to 59 years

60 to 64 years

65 to 69 years

70 to 74 years

75 to 79 years

80 to 84 years

85 years and over

Variables / Data Columns

Age Group: This column displays the age group in consideration

Population: The population for the specific age group in the Muddy Creek township is shown in this column.

% of Total Population: This column displays the population of each age group as a proportion of Muddy Creek township total population. Please note that the sum of all percentages may not equal one due to rounding of values.

Good to know

Margin of Error

Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

Custom data

If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

Inspiration

Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

Recommended for further research

This dataset is a part of the main dataset for Muddy Creek township Population by Age. You can refer the same here
d
MACROFLORAL ANALYSIS OF SAMPLES FROM THE DIRTY SHAME ROCKSHELTER (35ML65),...
search.dataone.org
Updated Aug 3, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Puseman, Kathryn (PaleoResearch Institute); Yost, Chad (PaleoResearch Institute) (2016). MACROFLORAL ANALYSIS OF SAMPLES FROM THE DIRTY SHAME ROCKSHELTER (35ML65), MALHEUR COUNTY, OREGON [Dataset]. http://doi.org/10.6067/XCV88915F6
Explore at:
Unique identifier
https://doi.org/10.6067/XCV88915F6
Dataset updated
Aug 3, 2016
Dataset provided by
the Digital Archaeological Record
Authors
Puseman, Kathryn (PaleoResearch Institute); Yost, Chad (PaleoResearch Institute)
Area covered

Description
Four samples from the Dirty Shame Rockshelter (35ML65) in southeast Oregon were floated to recover macrofloral remains. This site is a large shelter (approximately 60 meters long) and was excavated as a part of the 2010 University of Oregon Archaeological Field School with support from Dianne Pritchard, Vale District BLM Archaeologist. Samples reflect fill from a pole and thatch structure (Feature 1), as well as sediments from levels within the shelter. Radiocarbon dates of 1140 ± 95 BP and 1175 ± 70 BP were previously obtained from Feature 1. A basket fragment from Level 19 of Unit 1 yielded a radiocarbon date of 2685 ± 20 BP, while a date of 2980 ± 20 BP was returned for a basket fragment from Level 22 of Unit 2. These dates suggest multiple occupations in the shelter. Macrofloral analysis was used to provide information concerning plant resources utilized by the shelter occupants.
r
Student Messy Handwritten Dataset (SMHD)
researchdata.edu.au
Updated Nov 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vic ciesielski; Ruwan Tennakoon; James Thom; Hiqmat Nisa (2023). Student Messy Handwritten Dataset (SMHD) [Dataset]. http://doi.org/10.25439/RMT.24312715.V1
Explore at:
Unique identifier
https://doi.org/10.25439/RMT.24312715.V1
Dataset updated
Nov 20, 2023
Dataset provided by
RMIT University, Australia
Authors
Vic ciesielski; Ruwan Tennakoon; James Thom; Hiqmat Nisa
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Within the central repository, there are subfolders of different categories. Each of these subfolders contains both images and their corresponding transcriptions, saved as .txt files. As an example, the folder 'summary-based-0001-0055' encompasses 55 handwritten image documents pertaining to the summary task, with the images ranging from 0001 to 0055 within this category. In the transcription files, any crossed-out content is denoted by the '#' symbol, facilitating the easy identification of files with or without such modifications.
Moreover, there exists a document detailing the transcription rules utilized for transcribing the dataset. Following these guidelines will enable the seamless addition of more images.
Dataset Description:
We have incorporated contributions from more than 500 students to construct the dataset. Handwritten examination papers are primary sources in academic institutes to assess student learning. In our experience as academics, we have found that student examination papers tend to be messy with all kinds of insertions and corrections and would thus be a great source of documents for investigating HTR in the wild. Unfortunately, student examination papers are not available due to ethical considerations. So, we created an exam-like situation to collect handwritten samples from students. The corpus of the collected data is academic-based. Usually, in academia, handwritten papers have lines in them. For this purpose, we drew lines using light colors on white paper. The height of a line is 1.5 pt and the space between two lines is 40 pt. The filled handwritten documents were scanned at a resolution of 300 dpi at a grey-level resolution of 8 bits.
Collection Process: The collection process was done in four different ways. In the first exercise, we asked participants to summarize a given text in their own words. We called it a summary-based dataset. In the summary writing task, we included 60 undergraduate students studying the English language as a subject. After getting their consent, we distributed printed text articles and we asked them to choose one article, read it and summarize it in a paragraph in 15 minutes. The corpus of the printed text articles given to the participants was collected from the Internet on different topics. The articles were related to current political situations, daily life activities, and the Covid-19 pandemic.
In the second exercise, we asked participants to write an essay from a given list of topics, or they could write on any topic of their choice. We called it an essay-based dataset. This dataset is collected from 250 High school students. We gave them 30 minutes to think about the topic and write for this task.
In the third exercise, we select participants from different subjects and ask them to write on a topic from their current study. We called it a subject-based dataset. For this study, we used undergraduate students from different subjects, including 33 students from Mathematics, 71 from Biological Sciences, 24 from Environmental Sciences, 17 from Physics, and more than 84 from English studies.
Finally a class-notes dataset, we have collected class notes from almost 31 students on the same topic. We asked students to take notes of every possible sentence the speaker delivered during the lecture. After finishing the lesson in almost 10 minutes, we asked students to recheck their notes and compare them with other classmates. We did not impose any time restrictions for rechecking. We observed more cross-outs and corrections in class-notes compared to summary-based and academic-based collections.
In all four exercises, we did not impose any rules on them, for example, spacing, usage of a pen, etc. We asked them to cross out the text if it seemed inappropriate. Although usually writers made corrections in a second read, we also gave an extra 5 minutes for correction purposes.
f
Repeated Measures data files
auckland.figshare.com
zip
Updated Nov 9, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gavin T. L. Brown (2020). Repeated Measures data files [Dataset]. http://doi.org/10.17608/k6.auckland.13211120.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.17608/k6.auckland.13211120.v1
Dataset updated
Nov 9, 2020
Dataset provided by
The University of Auckland
Authors
Gavin T. L. Brown
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This zip file contains data files for 3 activities described in the accompanying PPT slides 1. an excel spreadsheet for analysing gain scores in a 2 group, 2 times data array. this activity requires access to –https://campbellcollaboration.org/research-resources/effect-size-calculator.html to calculate effect size.2. an AMOS path model and SPSS data set for an autoregressive, bivariate path model with cross-lagging. This activity is related to the following article: Brown, G. T. L., & Marshall, J. C. (2012). The impact of training students how to write introductions for academic essays: An exploratory, longitudinal study. Assessment & Evaluation in Higher Education, 37(6), 653-670. doi:10.1080/02602938.2011.5632773. an AMOS latent curve model and SPSS data set for a 3-time latent factor model with an interaction mixed model that uses GPA as a predictor of the LCM start and slope or change factors. This activity makes use of data reported previously and a published data analysis case: Peterson, E. R., Brown, G. T. L., & Jun, M. C. (2015). Achievement emotions in higher education: A diary study exploring emotions across an assessment event. Contemporary Educational Psychology, 42, 82-96. doi:10.1016/j.cedpsych.2015.05.002andBrown, G. T. L., & Peterson, E. R. (2018). Evaluating repeated diary study responses: Latent curve modeling. In SAGE Research Methods Cases Part 2. Retrieved from http://methods.sagepub.com/case/evaluating-repeated-diary-study-responses-latent-curve-modeling doi:10.4135/9781526431592
n
Data from: ‘Tidy’ and ‘messy’ management alters natural enemy communities...
data.niaid.nih.gov
datadryad.org
zip
Updated Aug 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Monika Egerer (2022). ‘Tidy’ and ‘messy’ management alters natural enemy communities and pest control in urban agroecosystems [Dataset]. http://doi.org/10.7291/D15H4V
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.7291/D15H4V
Dataset updated
Aug 29, 2022
Dataset provided by
Technical University of Munich
Authors
Monika Egerer
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Agroecosystem management influences ecological interactions that underpin ecosystem services. In human-centered systems, people’s values and preferences influence management decisions. For example, aesthetic preferences for ‘tidy’ agroecosystems may remove vegetation complexity with potential negative impacts on beneficial associated biodiversity and ecosystem function. This may produce trade-offs in aesthetic- versus production-based management for ecosystem service provision. Yet, it is unclear how such preferences influence the ecology of small-scale urban agroecosystems, where aesthetic preferences for ‘tidiness’ are prominent among some gardener demographics. We used urban community gardens as a model system to experimentally test how aesthetic preferences for a ‘tidy garden’ versus a ‘messy garden’ influence insect pests, natural enemies, and pest control services. We manipulated gardens by mimicking a popular ‘tidy’ management practice – woodchip mulching – on the one hand, and simulating ‘messy’ gardens by adding ‘weedy’ plants to pathways on the other hand. Then, we measured for differences in natural enemy biodiversity (abundance, richness, community composition), and sentinel pest removal as a result of the tidy/messy manipulation. In addition, we measured vegetation and ground cover features of the garden system as measures of practices already in place. The tidy/messy manipulation did not significantly alter natural enemy or herbivore abundance within garden plots. The manipulation did, however, produce different compositions of natural enemy communities before and after the manipulation. Furthermore, the manipulation did affect short term gains and losses in predation services: the messy manipulation immediately lowered aphid pest removal compared to the tidy manipulation, while mulch already present in the system lowered Lepidoptera egg removal. Aesthetic preferences for ‘tidy’ green spaces often dominate urban landscapes. Yet, in urban food production systems, such aesthetic values and management preferences may create a fundamental tension in the provision of ecosystem services that support sustainable urban agriculture. Though human preferences may be hard to change, we suggest that gardeners allow some ‘messiness’ in their garden plots as a “lazy gardener” approach may promote particular natural enemy assemblages and may have no downsides to natural predation services.
d
Microbial Community Composition Data from Blacktail Creek near Williston,...
datasets.ai
data.usgs.gov
+1more
55
Updated Aug 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of the Interior (2024). Microbial Community Composition Data from Blacktail Creek near Williston, North Dakota [Dataset]. https://datasets.ai/datasets/microbial-community-composition-data-from-blacktail-creek-near-williston-north-dakota
Explore at:
55Available download formats
Dataset updated
Aug 6, 2024
Dataset authored and provided by
Department of the Interior
Area covered
Williston, North Dakota
Description
A large spill of wastewater from oil and gas operations was discovered adjacent to Blacktail Creek near Williston, North Dakota in January 2015. To determine the effects of this spill on streambed microbial communities over time, bed sediment samples were taken from Blacktail Creek upstream, adjacent to, and at several locations downstream from the spill site. Blacktail Creek is a tributary of the Little Muddy River, and additional samples were taken upstream and downstream from the confluence of Blacktail Creek and the Little Muddy River. Samples were collected in February 2015, June 2015, June 2016, and June 2017. DNA was extracted from these sediments, and sequencing of the 16S ribosomal RNA gene was performed to enable analysis of the microbial community structure. Raw sequence data was processed, and taxonomy was assigned based on the Silva 132 database (Yilmaz et al, 2014) using the MOTHUR software package (Schloss et al, 2009). Raw sequence data are available from GenBank at https://www.ncbi.nlm.nih.gov/bioproject/PRJNA666160.
N
Muddy, IL Population Breakdown by Gender and Age Dataset: Male and Female...
neilsberg.com
csv, json
Updated Feb 24, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neilsberg Research (2025). Muddy, IL Population Breakdown by Gender and Age Dataset: Male and Female Population Distribution Across 18 Age Groups // 2025 Edition [Dataset]. https://www.neilsberg.com/research/datasets/e1f38c67-f25d-11ef-8c1b-3860777c1fe6/
Explore at:
json, csvAvailable download formats
Dataset updated
Feb 24, 2025
Dataset authored and provided by
Neilsberg Research
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Illinois, Muddy
Variables measured
Male and Female Population Under 5 Years, Male and Female Population over 85 years, Male and Female Population Between 5 and 9 years, Male and Female Population Between 10 and 14 years, Male and Female Population Between 15 and 19 years, Male and Female Population Between 20 and 24 years, Male and Female Population Between 25 and 29 years, Male and Female Population Between 30 and 34 years, Male and Female Population Between 35 and 39 years, Male and Female Population Between 40 and 44 years, and 8 more
Measurement technique
The data presented in this dataset is derived from the latest U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates. To measure the three variables, namely (a) Population (Male), (b) Population (Female), and (c) Gender Ratio (Males per 100 Females), we initially analyzed and categorized the data for each of the gender classifications (biological sex) reported by the US Census Bureau across 18 age groups, ranging from under 5 years to 85 years and above. These age groups are described above in the variables section. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
Dataset funded by
Neilsberg Research
Description
About this dataset

Context

The dataset tabulates the population of Muddy by gender across 18 age groups. It lists the male and female population in each age group along with the gender ratio for Muddy. The dataset can be utilized to understand the population distribution of Muddy by gender and age. For example, using this dataset, we can identify the largest age group for both Men and Women in Muddy. Additionally, it can be used to see how the gender ratio changes from birth to senior most age group and male to female ratio across each age group for Muddy.

Key observations

Largest age group (population): Male # 65-69 years (6) | Female # 55-59 years (13). Source: U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.

Content

When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.

Age groups:

Under 5 years

5 to 9 years

10 to 14 years

15 to 19 years

20 to 24 years

25 to 29 years

30 to 34 years

35 to 39 years

40 to 44 years

45 to 49 years

50 to 54 years

55 to 59 years

60 to 64 years

65 to 69 years

70 to 74 years

75 to 79 years

80 to 84 years

85 years and over

Scope of gender :

Please note that American Community Survey asks a question about the respondents current sex, but not about gender, sexual orientation, or sex at birth. The question is intended to capture data for biological sex, not gender. Respondents are supposed to respond with the answer as either of Male or Female. Our research and this dataset mirrors the data reported as Male and Female for gender distribution analysis.

Variables / Data Columns

Age Group: This column displays the age group for the Muddy population analysis. Total expected values are 18 and are define above in the age groups section.

Population (Male): The male population in the Muddy is shown in the following column.

Population (Female): The female population in the Muddy is shown in the following column.

Gender Ratio: Also known as the sex ratio, this column displays the number of males per 100 females in Muddy for each age group.

Good to know

Margin of Error

Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

Custom data

If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

Inspiration

Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

Recommended for further research

This dataset is a part of the main dataset for Muddy Population by Gender. You can refer the same here
U
Model Archive Summary for Suspended-Sediment Concentration at U.S....
data.usgs.gov
gimi9.com
+2more
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rodney Richards; Natalie Day, Model Archive Summary for Suspended-Sediment Concentration at U.S. Geological Survey Site 385903107210800; Muddy Creek above Paonia Reservoir, Colorado [Dataset]. http://doi.org/10.5066/P9USKIRP
Explore at:
Unique identifier
https://doi.org/10.5066/P9USKIRP
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Authors
Rodney Richards; Natalie Day
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Area covered
Paonia Reservoir, Colorado, Muddy Creek
Description
This model archive summary documents the suspended-sediment concentration (SSC) model developed to estimate 15-minute SSC at Muddy Creek above Paonia Reservoir, U.S. Geological Survey (USGS) site number 385903107210800. The methods used follow USGS guidance as referenced in relevant Office of Surface Water Technical Memorandum (TM) 2016.07 and Office of Water Quality TM 2016.10, and USGS Techniques and Methods, book 3, chap. C5 (Landers and others, 2016). A total of 438 suspended-sediment samples were collected during the calibration period. Forty-one of these samples (22 equal-width-interval [EWI] samples and 19 single-point pump samples) were used in the model calibration dataset. These 41 samples were collected over the range of observed streamflow, Sediment Corrected Backscatter (SCB), and Sediment Attenuation Coefficient (SAC) conditions. Samples used in calibration were plotted on duration curve plots for streamflow from March 2005 to November 2016 (Colorado Division of Wat ...
d
Disjoint-DABS: A Benchmark for Dynamic Aspect-Based Summarization in...
search.dataone.org
Updated Mar 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Guo, Xiaobo; Vosoughi, Soroush (2024). Disjoint-DABS: A Benchmark for Dynamic Aspect-Based Summarization in Disordered Texts [Dataset]. http://doi.org/10.7910/DVN/OEE1RI
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/OEE1RI
Dataset updated
Mar 6, 2024
Dataset provided by
Harvard Dataverse
Authors
Guo, Xiaobo; Vosoughi, Soroush
Description
This is the dataset for the paper "Disjoint-DABS: A Benchmark for Dynamic Aspect-Based Summarization in Disorganized Texts". It includes two sub-datasets converted from CNN/DailyMail (D-CnnDM.zip) and WikiHow (D-WikiHow.zip). We include the data with training, validation, and test split. The file for training the summarization model is at (WikiHowSep.zip and CnnDM.zip) We also include the small-scale data for D-WikiHow used for prompting experiments (D-WikiHow-sample). The generated summaries for all baselines for further research, especially for human evaluation is included (result.zip).
n
Data from: Generalizable EHR-R-REDCap pipeline for a national...
data.niaid.nih.gov
zenodo.org
+1more
zip
Updated Jan 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller (2022). Generalizable EHR-R-REDCap pipeline for a national multi-institutional rare tumor patient registry [Dataset]. http://doi.org/10.5061/dryad.rjdfn2zcm
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.rjdfn2zcm
Dataset updated
Jan 9, 2022
Dataset provided by
Massachusetts General Hospital
Harvard Medical School
Authors
Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Objective: To develop a clinical informatics pipeline designed to capture large-scale structured EHR data for a national patient registry.

Materials and Methods: The EHR-R-REDCap pipeline is implemented using R-statistical software to remap and import structured EHR data into the REDCap-based multi-institutional Merkel Cell Carcinoma (MCC) Patient Registry using an adaptable data dictionary.

Results: Clinical laboratory data were extracted from EPIC Clarity across several participating institutions. Labs were transformed, remapped and imported into the MCC registry using the EHR labs abstraction (eLAB) pipeline. Forty-nine clinical tests encompassing 482,450 results were imported into the registry for 1,109 enrolled MCC patients. Data-quality assessment revealed highly accurate, valid labs. Univariate modeling was performed for labs at baseline on overall survival (N=176) using this clinical informatics pipeline.

Conclusion: We demonstrate feasibility of the facile eLAB workflow. EHR data is successfully transformed, and bulk-loaded/imported into a REDCap-based national registry to execute real-world data analysis and interoperability.

Methods eLAB Development and Source Code (R statistical software):

eLAB is written in R (version 4.0.3), and utilizes the following packages for processing: DescTools, REDCapR, reshape2, splitstackshape, readxl, survival, survminer, and tidyverse. Source code for eLAB can be downloaded directly (https://github.com/TheMillerLab/eLAB).

eLAB reformats EHR data abstracted for an identified population of patients (e.g. medical record numbers (MRN)/name list) under an Institutional Review Board (IRB)-approved protocol. The MCCPR does not host MRNs/names and eLAB converts these to MCCPR assigned record identification numbers (record_id) before import for de-identification.

Functions were written to remap EHR bulk lab data pulls/queries from several sources including Clarity/Crystal reports or institutional EDW including Research Patient Data Registry (RPDR) at MGB. The input, a csv/delimited file of labs for user-defined patients, may vary. Thus, users may need to adapt the initial data wrangling script based on the data input format. However, the downstream transformation, code-lab lookup tables, outcomes analysis, and LOINC remapping are standard for use with the provided REDCap Data Dictionary, DataDictionary_eLAB.csv. The available R-markdown ((https://github.com/TheMillerLab/eLAB) provides suggestions and instructions on where or when upfront script modifications may be necessary to accommodate input variability.

The eLAB pipeline takes several inputs. For example, the input for use with the ‘ehr_format(dt)’ single-line command is non-tabular data assigned as R object ‘dt’ with 4 columns: 1) Patient Name (MRN), 2) Collection Date, 3) Collection Time, and 4) Lab Results wherein several lab panels are in one data frame cell. A mock dataset in this ‘untidy-format’ is provided for demonstration purposes (https://github.com/TheMillerLab/eLAB).

Bulk lab data pulls often result in subtypes of the same lab. For example, potassium labs are reported as “Potassium,” “Potassium-External,” “Potassium(POC),” “Potassium,whole-bld,” “Potassium-Level-External,” “Potassium,venous,” and “Potassium-whole-bld/plasma.” eLAB utilizes a key-value lookup table with ~300 lab subtypes for remapping labs to the Data Dictionary (DD) code. eLAB reformats/accepts only those lab units pre-defined by the registry DD. The lab lookup table is provided for direct use or may be re-configured/updated to meet end-user specifications. eLAB is designed to remap, transform, and filter/adjust value units of semi-structured/structured bulk laboratory values data pulls from the EHR to align with the pre-defined code of the DD.

Data Dictionary (DD)

EHR clinical laboratory data is captured in REDCap using the ‘Labs’ repeating instrument (Supplemental Figures 1-2). The DD is provided for use by researchers at REDCap-participating institutions and is optimized to accommodate the same lab-type captured more than once on the same day for the same patient. The instrument captures 35 clinical lab types. The DD serves several major purposes in the eLAB pipeline. First, it defines every lab type of interest and associated lab unit of interest with a set field/variable name. It also restricts/defines the type of data allowed for entry for each data field, such as a string or numerics. The DD is uploaded into REDCap by every participating site/collaborator and ensures each site collects and codes the data the same way. Automation pipelines, such as eLAB, are designed to remap/clean and reformat data/units utilizing key-value look-up tables that filter and select only the labs/units of interest. eLAB ensures the data pulled from the EHR contains the correct unit and format pre-configured by the DD. The use of the same DD at every participating site ensures that the data field code, format, and relationships in the database are uniform across each site to allow for the simple aggregation of the multi-site data. For example, since every site in the MCCPR uses the same DD, aggregation is efficient and different site csv files are simply combined.

Study Cohort

This study was approved by the MGB IRB. Search of the EHR was performed to identify patients diagnosed with MCC between 1975-2021 (N=1,109) for inclusion in the MCCPR. Subjects diagnosed with primary cutaneous MCC between 2016-2019 (N= 176) were included in the test cohort for exploratory studies of lab result associations with overall survival (OS) using eLAB.

Statistical Analysis

OS is defined as the time from date of MCC diagnosis to date of death. Data was censored at the date of the last follow-up visit if no death event occurred. Univariable Cox proportional hazard modeling was performed among all lab predictors. Due to the hypothesis-generating nature of the work, p-values were exploratory and Bonferroni corrections were not applied.
i
WSP Global Scaling up Handwashing Behavior Impact Evaluation, Baseline and...
datacatalog.ihsn.org
catalog.ihsn.org
+1more
Updated Mar 29, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Water and Sanitation Program (2019). WSP Global Scaling up Handwashing Behavior Impact Evaluation, Baseline and Endline Surveys 2009-2011 - Peru [Dataset]. https://datacatalog.ihsn.org/catalog/4685
Explore at:
Dataset updated
Mar 29, 2019
Dataset authored and provided by
Water and Sanitation Program
Time period covered
2008 - 2011
Area covered
Peru
Description
Abstract

In Peru, the handwashing project targets mothers/caregivers of children under five years old, and it is aimed at improving handwashing with soap practices. Children under five represent the age group most susceptible to diarrheal disease and acute respiratory infections, which are two major causes of childhood morbidity and mortality in less developed countries.

These infections, usually transferred from dirty hands to food or water sources, or by direct contact with the mouth, can be prevented if mothers/caregivers wash their hands with soap at critical times (such as before feeding a child, cooking, eating, and after using a toilet or changing a child’s diapers). In an effort to improve handwashing behavior, the intervention borrows from both commercial and social marketing fields. This entails the design of communications campaigns and messages likely to bring about the desired behavior changes, and delivering them strategically so that the target audiences are “surrounded” by handwashing promotion.

Some key elements of the intervention include: • Key behavioral concepts or triggers for each target audience • Persuasive arguments stating why and how a given concept or trigger will lead to behavior change, and • Communication ideas to convey the concepts through many integrated activities and communication channels.

The objective of the IE is to assess the effects of the project on individual-level handwashing behavior and practices of caregivers and children. By introducing exogenous variation in handwashing promotion (through randomized exposure to the project), the IE also addresses important issues related to the effect of intended behavioral change on child health and development outcomes. In particular, it provides information on the extent to which improved handwashing behavior impacts infant health and welfare.

Geographic coverage

The sample included in the IE study is not representative of the Peruvian population at the national level because the selection of provinces and districts was random and not weighted by population, as would be necessary to be geographically representative. Because populations differ across provinces and districts, the three-stage sampling design introduced a type of bias (with respect to geographical representativeness) because selection probabilities varied across administrative units.

Analysis unit

Household

Person

Caregiver

Child (under 5 and under 2)

Kind of data

Sample survey data [ssd]

Sampling procedure

The primary objective of the project is to improve the health and welfare of young children. The sample size (total number of households) was chosen to capture a minimum effect size of 20 percent on the key outcome indicator of diarrhea prevalence among children under two years old at the time of the baseline. The selection of households with children in this age group was made under the assumption that health outcome measurements for young children in this age range are most sensitive to changes in hygiene in the environment. Data was collected for household members of all age ranges and the corresponding data analysis was conducted for older children and adults as well. Power calculations indicated that, in order to capture a 20 percent reduction in diarrhea incidence, around 600 households per treatment arm would need to be surveyed. Therefore, since the evaluation consists of three treatment groups and two control groups, the final sample incorporates approximately 3,000 households, each with children less than two years of age at the time the survey was conducted. An additional 500 households were added to the sample size in order to address potential attrition (loss of participants during the project); thus the minimal necessary sample size was 3,500 households (around 700 households per arm).

To select the sample, the IE team used a three-stage sampling methodology: • Stage 1: Province Level

From 195 total provinces in Peru, Pisco and Lima were excluded at the request of the implementation team.2 Of the remaining 193 provinces, 80 provinces were randomly chosen. Out of these 80 provinces, two groups of 40 provinces each were randomly formed: Group of Provinces 1 (GP1) and Group of Provinces 2 (GP2). • Stage 2: District Level

In order to assess the impact of each of the components of the project in the health of children younger than five years old, the evaluation study has two main treatments, that is, one per component. These are the Mass Media Treatment at the provincial level, also referred to as Treatment 1 (T1), and the Social Mobilization Treatment at the district level, also referred to as Treatment 2 (T2). In order to evaluate and identify the health impacts of each component, a counterfactual to T1 and T2 is needed, which we refer to as the Control (C). The three groups, T1, T2, and C include households with children under two years old at the time of the baseline.

Out of the first group of 40 provinces, GP1, 40 districts between 1,500 and 100,000 habitants were randomly chosen to receive T1. From the second group, GP2, 80 districts between 1,500 and 100,000 habitants were selected randomly; 40 of them were randomly assigned to receive T2, and the other 40 districts to serve as C to T1 and T2.

• Stage 3: Household Level For each of the three sets of 40 districts (120 districts total) allocated to T1, T2, and C, 15-20 households with children under two years of age were selected at random in each district. Also, in each of the 40 districts

Mode of data collection

Face-to-face [f2f]

Research instrument

The following instruments were used to collect the data: • Household questionnaire: The household questionnaire was conducted in all households and was designed to collect data on household membership, education, labor, income, assets, dwelling characteristics,water sources, drinking water, sanitation,observations of handwashing facilities and other dwelling characteristics, handwashing behavior, child discipline, maternal depression, handwashing determinants, exposure to health interventions, relationship between family and school, and mortality.

• Health questionnaire: The health questionnaire was conducted in all households and designed to collect data on children’s diarrhea prevalence, ALRI and other health symptoms, child development, child growth, and anemia.

• Community questionnaire: The community questionnaire was conducted in 120 districts to collect data on community/districts variables.

• Structured observations: Structured observations were conducted in a subsample of 160 households to collect data on direct observation of handwashing behavior.

• Water samples: Water samples were collected in a subsample of 160 households, to identify Escherichia coli (E. coli) presence in hand rinses (mother and children), sentinel toy, and drinking water.

• Stool samples: Stool samples were collected in a subsample of 160 households to identify prevalence of parasites in children’s feces.

Cleaning operations

Baseline: The baseline survey was processed using the assistance of Sistemas Integrales in Chile. A manual for the data entry system is attached under the title of: Data Entry Manual:Baseline.

Endline: Kimetrica International was contracted to design the data reduction system to be used during the endline. The data entry system was designed in CSPro (Version 4.1) using the DHS file management system as a standard for file management. Details of the system can be found in the attached manual entitled: Data Entry Manual for the Endline Survey.

The data entry system was based on a full double data entry (independent verification) of the various questionnaires. CSPro supports both dependent and independent verification (double keying) to ensure the accuracy of the data entry operation. Using independent verification, operators can key data into separate data files and use CSPro utilities to compare them and produce a report that indicates discrepancies in data entry.

The DHS system uses a fully integrated tracking system to follow the stages in the data entry process. This includes the checking in of questionnaires; the programming of logic in what is known as a system controlled environment. System controlled applications generally place more restrictions on the data entry operator. This is typically used for complex survey applications. The behavior of these applications at data entry time has the following characteristics:

Some special data entry keys are not active during data entry.

CSEntry will keep track of the path.

'Not applicable' or blanks values will not be allowed. Missing values have to be coded.

More appropriate to the heads up methodology of data capture.

Logic in the application is strictly enforced; operator cannot bypass or override.

Files were processed using the unique cluster number and then concatenated after a final stage of editing and output to both SPSS and STATA.

Furthermore, attempts were made to respect the values and the naming conventions as provided in the baseline. This required using non-conventional values for “missing” such as -99. In most cases the same value sets were applied or during the questionnaire review process the WSP was alerted to such discrepancies.

Response rate

Baseline 1 Completed interview -----> 3508 --->94.3
2 Incomplete interview ----->48 --->1.3
3 Not available ----->7 --->.2
4 Rescheduled interview ----->7 --->.2
5 Nobody at home ----->48 --->1.3
6 Temporarily away ----->59 --->1.6 7 Refused to participate ----->44 --->1.2

Total
Week 4: Techniques and Sampling - Pan Trap Dataset Gathering int he Danby...
figshare.com
xlsx
Updated Jan 20, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nina Lee (2016). Week 4: Techniques and Sampling - Pan Trap Dataset Gathering int he Danby Woodlot & Grasslands at York University [Dataset]. http://doi.org/10.6084/m9.figshare.1565577.v2
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1565577.v2
Dataset updated
Jan 20, 2016
Dataset provided by
Figsharehttp://figshare.com/
Authors
Nina Lee
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
On Tuesday September 29th at approximately from 2:45-5:30 pm, an experiment for the Pan Trap Dataset was conducted with the assistance of my four other group members including Ashley, Adam, Katherine, and Kate. Initially, the weather was partially cloudy with drizzling rain and as the time went on the rain got heavier, there was less sunlight and there was a cool breeze. As hypothesized before gathering our sampling data, the rain weather would have an influence on the abundance of insects that would come out of the soil and become observable on the surface of the ground. In other words, it was predicted that in most cases of insects, there would be a negative correlation between the amount of moisture gathered on the surface of the ground and the frequency at which insects would be observed. For this particular dataset experiment, three different colours of Solo Bowls (white, blue and yellow) were filled with soapy water and each placed in a group. Each group, consisting of three different coloured bowls were positioned at random places in the two distinct pre-set locations at the Danby Woodlot and the Grassland area. The reason for which three different colours were used was because of attempting to target variety of different insects which would be attracted to certain colours—such as bees which are specifically attracted to the colour yellow for pollination of flowers—and therefore, this way our data would be unbiased and would include most insects that were present in that environment. In order to make sure that we were doing random sampling and further lower the chance of bias in our data collection, each group of three bowls were placed at different random locations within the Woodlot or Grassland site. Similarly, we have placed the bowls in two different environmental settings of Woodlot and Grassland areas due to attempting to gather unbiased and random data. This way, perhaps more variety of insects present in different environmental settings can be gathered. At the end, as it was observed and counted, there was more variety of insects present in the Grassland area. Perhaps, insects were able to take shelter among the heavy profusion of grass on the surface of the ground instead of bare surface of the damp Earth which was more observed in the Woodlot area. Insects can take refuge in hollow logs and in between tree branches and trunks during rain, so perhaps that is the reason why they were noticeably absent in the data samples from the Woodlot area. Overall, our hypothesis was to some extent correct when it came to considering the results of our data collection from the Woodlot area. Not many insects were present on the damp muddy soil of the Woodlot due to heavy pouring rain. It is highly likely that most insects were hiding and taking shelter beneath the ground or in hollow tree trunks for their own survival.
Energy Consumption of United States Over Time
kaggle.com
Updated Dec 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). Energy Consumption of United States Over Time [Dataset]. https://www.kaggle.com/datasets/thedevastator/unlocking-the-energy-consumption-of-united-state
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 14, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Area covered
United States
Description
Energy Consumption of United States Over Time

Building Energy Data Book

By Department of Energy [source]

About this dataset

The Building Energy Data Book (2011) is an invaluable resource for gaining insight into the current state of energy consumption in the buildings sector. This dataset provides comprehensive data on residential, commercial and industrial building energy consumption, construction techniques, building technologies and characteristics. With this resource, you can get an in-depth understanding of how energy is used in various types of buildings - from single family homes to large office complexes - as well as its impact on the environment. The BTO within the U.S Department of Energy's Office of Energy Efficiency and Renewable Energy developed this dataset to provide a wealth of knowledge for researchers, policy makers, engineers and even everyday observers who are interested in learning more about our built environment and its energy usage patterns

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset provides comprehensive information regarding energy consumption in the buildings sector of the United States. It contains a number of key variables which can be used to analyze and explore the relations between energy consumption and building characteristics, technologies, and construction. The data is provided in both CSV format as well as tabular format which can make it helpful for those who prefer to use programs like Excel or other statistical modeling software.

In order to get started with this dataset we've developed a guide outlining how to effectively use it for your research or project needs.

Understand what's included: Before you start analyzing the data, you should read through the provided documentation so that you fully understand what is included in the datasets. You'll want to be aware of any potential limitations or requirements associated with each type of data point so that your results are valid and reliable when drawing conclusions from them.

Clean up any outliers: You may need to take some time upfront investigating suspicious outliers within your dataset before using it in any further analyses — otherwise, they can skew results down the road if not dealt with first-hand! Furthermore, they could also make complex statistical modeling more difficult as well since they artificially inflate values depending on their magnitude within each example data point (i.e., one outlier could affect an entire model’s prior distributions). Missing values should also be accounted for too since these may not always appear obvious at first glance when reviewing a table or graphical representation - but accurate statistics must still be obtained either way no matter how messy things seem!

Exploratory data analysis: After cleaning up your dataset you'll want to do some basic exploring by visualizing different types of summaries like boxplots, histograms and scatter plots etc.. This will give you an initial case into what trends might exist within certain demographic/geographic/etc.. regions & variables which can then help inform future predictive models when needed! Additionally this step will highlight any clear discontinuous changes over time due over-generalization (if applicable), making sure predictors themselves don’t become part noise instead contributing meaningful signals towards overall effect predictions accuracy etc…

Analyze key metrics & observations: Once exploratory analyses have been carried out on rawsamples post-processing steps are next such as analyzing metrics such ascorrelations amongst explanatory functions; performing significance testing regression models; imputing missing/outlier values and much more depending upon specific project needs at hand… Additionally – interpretation efforts based

Research Ideas

Creating an energy efficiency rating system for buildings - Using the dataset, an organization can develop a metric to rate the energy efficiency of commercial and residential buildings in a standardized way.

Developing targeted campaigns to raise awareness about energy conservation - Analyzing data from this dataset can help organizations identify areas of high energy consumption and create targeted campaigns and incentives to encourage people to conserve energy in those areas.

Estimating costs associated with upgrading building technologies - By evaluating various trends in building technologies and their associated costs, decision-makers can determine the most cost-effective option when it comes time to upgrade their structures' energy efficiency...
f
Data_Sheet_4_A Flexible, Extensible, Machine-Readable, Human-Intelligible,...
figshare.com
docx
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gideon Kruseman (2023). Data_Sheet_4_A Flexible, Extensible, Machine-Readable, Human-Intelligible, and Ontology-Agnostic Metadata Schema (OIMS).DOCX [Dataset]. http://doi.org/10.3389/fsufs.2022.767863.s004
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fsufs.2022.767863.s004
Dataset updated
May 31, 2023
Dataset provided by
Frontiers
Authors
Gideon Kruseman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This paper presents a lightweight, flexible, extensible, machine readable and human-intelligible metadata schema that does not depend on a specific ontology. The metadata schema for metadata of data files is based on the concept of data lakes where data is stored as they are. The purpose of the schema is to enhance data interoperability. The lack of interoperability of messy socio-economic datasets that contain a mixture of structured, semi-structured, and unstructured data means that many datasets are underutilized. Adding a minimum set of rich metadata and describing new and existing data dictionaries in a standardized way goes a long way to make these high-variety datasets interoperable and reusable and hence allows timely and actionable information to be gleaned from those datasets. The presented metadata schema OIMS can help to standardize the description of metadata. The paper introduces overall concepts of metadata, discusses design principles of metadata schemes, and presents the structure and an applied example of OIMS.

Facebook

Twitter

Click to copy link

Link copied

Cite

Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177

Data Cleaning Sample

Explore at:

151 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Unique identifier

https://doi.org/10.5683/SP3/ZCN177

Dataset updated

Jul 13, 2023

Dataset provided by

Borealis

Authors

Rong Luo

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

Sample data for exercises in Further Adventures in Data Cleaning.

Clear search

Close search

Google apps

Main menu

Data Cleaning Sample

Messy Spreadsheet Example for Instruction

Dirty-MNIST Dataset

Employee Sample Data

BigMart Retail Sales

Data from: Water-quality data for a pharmaceutical study at Muddy Creek in...

Muddy Creek Township, Pennsylvania Age Group Population Dataset: A Complete...

About this dataset

Content

Inspiration

Recommended for further research

MACROFLORAL ANALYSIS OF SAMPLES FROM THE DIRTY SHAME ROCKSHELTER (35ML65),...

Student Messy Handwritten Dataset (SMHD)

Dataset Description:

Repeated Measures data files

Data from: ‘Tidy’ and ‘messy’ management alters natural enemy communities...

Microbial Community Composition Data from Blacktail Creek near Williston,...

Muddy, IL Population Breakdown by Gender and Age Dataset: Male and Female...

About this dataset

Content

Inspiration

Recommended for further research

Model Archive Summary for Suspended-Sediment Concentration at U.S....

Disjoint-DABS: A Benchmark for Dynamic Aspect-Based Summarization in...

Data from: Generalizable EHR-R-REDCap pipeline for a national...

WSP Global Scaling up Handwashing Behavior Impact Evaluation, Baseline and...

Abstract

Geographic coverage

Analysis unit

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Cleaning operations

Response rate

Week 4: Techniques and Sampling - Pan Trap Dataset Gathering int he Danby...

Energy Consumption of United States Over Time

Energy Consumption of United States Over Time

Building Energy Data Book

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Data_Sheet_4_A Flexible, Extensible, Machine-Readable, Human-Intelligible,...

Data Cleaning Sample