71 datasets found
  1. B

    Data Cleaning Sample

    • borealisdata.ca
    Updated Jul 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 13, 2023
    Dataset provided by
    Borealis
    Authors
    Rong Luo
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Sample data for exercises in Further Adventures in Data Cleaning.

  2. Messy Spreadsheet Example for Instruction

    • zenodo.org
    • data.niaid.nih.gov
    bin
    Updated Jun 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Renata Gonçalves Curty; Renata Gonçalves Curty (2024). Messy Spreadsheet Example for Instruction [Dataset]. http://doi.org/10.5281/zenodo.12586563
    Explore at:
    binAvailable download formats
    Dataset updated
    Jun 28, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Renata Gonçalves Curty; Renata Gonçalves Curty
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jun 28, 2024
    Description

    A disorganized toy spreadsheet used for teaching good data organization. Learners are tasked with identifying as many errors as possible before creating a data dictionary and reconstructing the spreadsheet according to best practices.

  3. P

    Dirty-MNIST Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Jan 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jishnu Mukhoti; Andreas Kirsch; Joost van Amersfoort; Philip H. S. Torr; Yarin Gal (2024). Dirty-MNIST Dataset [Dataset]. https://paperswithcode.com/dataset/dirty-mnist
    Explore at:
    Dataset updated
    Jan 8, 2024
    Authors
    Jishnu Mukhoti; Andreas Kirsch; Joost van Amersfoort; Philip H. S. Torr; Yarin Gal
    Description

    DirtyMNIST is a concatenation of MNIST + AmbiguousMNIST, with 60k samples each in the training set. AmbiguousMNIST contains additional ambiguous digits with varying ambiguity. The AmbiguousMNIST test set contains 60k ambiguous samples as well.

    Additional Guidance

    DirtyMNIST is a concatenation of MNIST + AmbiguousMNIST, with 60k samples each in the training set. The current AmbiguousMNIST contains 6k unique samples with 10 labels each. This multi-label dataset gets flattened to 60k samples. The assumption is that ambiguous samples have multiple "valid" labels as they are ambiguous. MNIST samples are intentionally undersampled (in comparison), which benefits AL acquisition functions that can select unambiguous samples. Pick your initial training samples (for warm starting Active Learning) from the MNIST half of DirtyMNIST to avoid starting training with potentially very ambiguous samples, which might add a lot of variance to your experiments. Make sure to pick your validation set from the MNIST half as well, for the same reason as above. Make sure that your batch acquisition size is >= 10 (probably) given that there are 10 multi-labels per samples in Ambiguous-MNIST. By default, Gaussian noise with stddev 0.05 is added to each sample to prevent acquisition functions (in Active Learning) from cheating by disgarding "duplicates". If you want to split Ambiguous-MNIST into subsets (or Dirty-MNIST within the second ambiguous half), make sure to split by multiples of 10 to avoid splits within a flattened multi-label sample.

  4. Employee Sample Data

    • kaggle.com
    Updated May 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    William Lucas (2023). Employee Sample Data [Dataset]. https://www.kaggle.com/datasets/williamlucas0/employee-sample-data/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 29, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    William Lucas
    Description

    An unclean employee dataset can contain various types of errors, inconsistencies, and missing values that affect the accuracy and reliability of the data. Some common issues in unclean datasets include duplicate records, incomplete data, incorrect data types, spelling mistakes, inconsistent formatting, and outliers.

    For example, there might be multiple entries for the same employee with slightly different spellings of their name or job title. Additionally, some rows may have missing data for certain columns such as bonus or exit date, which can make it difficult to analyze trends or make accurate predictions. Inconsistent formatting of data, such as using different date formats or capitalization conventions, can also cause confusion and errors when processing the data.

    Furthermore, there may be outliers in the data, such as employees with extremely high or low salaries or ages, which can distort statistical analyses and lead to inaccurate conclusions.

    Overall, an unclean employee dataset can pose significant challenges for data analysis and decision-making, highlighting the importance of cleaning and preparing data before analyzing it

  5. Z

    BigMart Retail Sales

    • data.niaid.nih.gov
    Updated May 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataman (2022). BigMart Retail Sales [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6509954
    Explore at:
    Dataset updated
    May 2, 2022
    Dataset authored and provided by
    Dataman
    License

    Attribution 1.0 (CC BY 1.0)https://creativecommons.org/licenses/by/1.0/
    License information was derived automatically

    Description

    Nothing ever becomes real till it is experienced.

    -John Keats

    While we don't know the context in which John Keats mentioned this, we are sure about its implication in data science. While you would have enjoyed and gained exposure to real world problems in this challenge, here is another opportunity to get your hand dirty with this practice problem.

    Problem Statement :

    The data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store have been defined. The aim is to build a predictive model and find out the sales of each product at a particular store.

    Using this model, BigMart will try to understand the properties of products and stores which play a key role in increasing sales.

    Please note that the data may have missing values as some stores might not report all the data due to technical glitches. Hence, it will be required to treat them accordingly.

    Data :

    We have 14204 samples in data set.

    Variable Description

    Item Identifier: A code provided for the item of sale

    Item Weight: Weight of item

    Item Fat Content: A categorical column of how much fat is present in the item: ‘Low Fat’, ‘Regular’, ‘low fat’, ‘LF’, ‘reg’

    Item Visibility: Numeric value for how visible the item is

    Item Type: What category does the item belong to: ‘Dairy’, ‘Soft Drinks’, ‘Meat’, ‘Fruits and Vegetables’, ‘Household’, ‘Baking Goods’, ‘Snack Foods’, ‘Frozen Foods’, ‘Breakfast’, ’Health and Hygiene’, ‘Hard Drinks’, ‘Canned’, ‘Breads’, ‘Starchy Foods’, ‘Others’, ‘Seafood’.

    Item MRP: The MRP price of item

    Outlet Identifier: Which outlet was the item sold. This will be categorical column

    Outlet Establishment Year: Which year was the outlet established

    Outlet Size: A categorical column to explain size of outlet: ‘Medium’, ‘High’, ‘Small’.

    Outlet Location Type: A categorical column to describe the location of the outlet: ‘Tier 1’, ‘Tier 2’, ‘Tier 3’

    Outlet Type: Categorical column for type of outlet: ‘Supermarket Type1’, ‘Supermarket Type2’, ‘Supermarket Type3’, ‘Grocery Store’

    Item Outlet Sales: The number of sales for an item.

    Evaluation Metric:

    We will use the Root Mean Square Error value to judge your response

  6. d

    Data from: Water-quality data for a pharmaceutical study at Muddy Creek in...

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Water-quality data for a pharmaceutical study at Muddy Creek in North Liberty and Coralville, Iowa, 2017-2018 [Dataset]. https://catalog.data.gov/dataset/water-quality-data-for-a-pharmaceutical-study-at-muddy-creek-in-north-liberty-and-cor-2017
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    U.S. Geological Survey
    Area covered
    Coralville, North Liberty, Muddy Creek, Iowa
    Description

    Surface-water samples were collected, processed, and analyzed for organics, estrogen equivalents, and fecal indicator bacteria. Filtered organic samples were sent to the National Water Quality Laboratory in Denver, Colorado. Unfiltered estrogen equivalent samples were sent to the Organic Geochemistry Research Lab in Lawrence, Kansas, for extraction, after which they were sent to the National Fish Health Research Laboratory in Leetown, West Virginia. Bacteria samples were processed at the Central-Midwest Water Science Center Iowa City, Iowa, office. Staff collected field parameters in-situ.

  7. N

    Muddy Creek Township, Pennsylvania Age Group Population Dataset: A Complete...

    • neilsberg.com
    csv, json
    Updated Feb 22, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neilsberg Research (2025). Muddy Creek Township, Pennsylvania Age Group Population Dataset: A Complete Breakdown of Muddy Creek township Age Demographics from 0 to 85 Years and Over, Distributed Across 18 Age Groups // 2025 Edition [Dataset]. https://www.neilsberg.com/research/datasets/4538db12-f122-11ef-8c1b-3860777c1fe6/
    Explore at:
    json, csvAvailable download formats
    Dataset updated
    Feb 22, 2025
    Dataset authored and provided by
    Neilsberg Research
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Muddy Creek Township, Pennsylvania
    Variables measured
    Population Under 5 Years, Population over 85 years, Population Between 5 and 9 years, Population Between 10 and 14 years, Population Between 15 and 19 years, Population Between 20 and 24 years, Population Between 25 and 29 years, Population Between 30 and 34 years, Population Between 35 and 39 years, Population Between 40 and 44 years, and 9 more
    Measurement technique
    The data presented in this dataset is derived from the latest U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates. To measure the two variables, namely (a) population and (b) population as a percentage of the total population, we initially analyzed and categorized the data for each of the age groups. For age groups we divided it into roughly a 5 year bucket for ages between 0 and 85. For over 85, we aggregated data into a single group for all ages. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
    Dataset funded by
    Neilsberg Research
    Description
    About this dataset

    Context

    The dataset tabulates the Muddy Creek township population distribution across 18 age groups. It lists the population in each age group along with the percentage population relative of the total population for Muddy Creek township. The dataset can be utilized to understand the population distribution of Muddy Creek township by age. For example, using this dataset, we can identify the largest age group in Muddy Creek township.

    Key observations

    The largest age group in Muddy Creek Township, Pennsylvania was for the group of age 15 to 19 years years with a population of 193 (9.08%), according to the ACS 2019-2023 5-Year Estimates. At the same time, the smallest age group in Muddy Creek Township, Pennsylvania was the 80 to 84 years years with a population of 11 (0.52%). Source: U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates

    Content

    When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates

    Age groups:

    • Under 5 years
    • 5 to 9 years
    • 10 to 14 years
    • 15 to 19 years
    • 20 to 24 years
    • 25 to 29 years
    • 30 to 34 years
    • 35 to 39 years
    • 40 to 44 years
    • 45 to 49 years
    • 50 to 54 years
    • 55 to 59 years
    • 60 to 64 years
    • 65 to 69 years
    • 70 to 74 years
    • 75 to 79 years
    • 80 to 84 years
    • 85 years and over

    Variables / Data Columns

    • Age Group: This column displays the age group in consideration
    • Population: The population for the specific age group in the Muddy Creek township is shown in this column.
    • % of Total Population: This column displays the population of each age group as a proportion of Muddy Creek township total population. Please note that the sum of all percentages may not equal one due to rounding of values.

    Good to know

    Margin of Error

    Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

    Custom data

    If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

    Inspiration

    Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

    Recommended for further research

    This dataset is a part of the main dataset for Muddy Creek township Population by Age. You can refer the same here

  8. d

    MACROFLORAL ANALYSIS OF SAMPLES FROM THE DIRTY SHAME ROCKSHELTER (35ML65),...

    • search.dataone.org
    Updated Aug 3, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Puseman, Kathryn (PaleoResearch Institute); Yost, Chad (PaleoResearch Institute) (2016). MACROFLORAL ANALYSIS OF SAMPLES FROM THE DIRTY SHAME ROCKSHELTER (35ML65), MALHEUR COUNTY, OREGON [Dataset]. http://doi.org/10.6067/XCV88915F6
    Explore at:
    Dataset updated
    Aug 3, 2016
    Dataset provided by
    the Digital Archaeological Record
    Authors
    Puseman, Kathryn (PaleoResearch Institute); Yost, Chad (PaleoResearch Institute)
    Area covered
    Description

    Four samples from the Dirty Shame Rockshelter (35ML65) in southeast Oregon were floated to recover macrofloral remains. This site is a large shelter (approximately 60 meters long) and was excavated as a part of the 2010 University of Oregon Archaeological Field School with support from Dianne Pritchard, Vale District BLM Archaeologist. Samples reflect fill from a pole and thatch structure (Feature 1), as well as sediments from levels within the shelter. Radiocarbon dates of 1140 ± 95 BP and 1175 ± 70 BP were previously obtained from Feature 1. A basket fragment from Level 19 of Unit 1 yielded a radiocarbon date of 2685 ± 20 BP, while a date of 2980 ± 20 BP was returned for a basket fragment from Level 22 of Unit 2. These dates suggest multiple occupations in the shelter. Macrofloral analysis was used to provide information concerning plant resources utilized by the shelter occupants.

  9. r

    Student Messy Handwritten Dataset (SMHD)

    • researchdata.edu.au
    Updated Nov 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vic ciesielski; Ruwan Tennakoon; James Thom; Hiqmat Nisa (2023). Student Messy Handwritten Dataset (SMHD) [Dataset]. http://doi.org/10.25439/RMT.24312715.V1
    Explore at:
    Dataset updated
    Nov 20, 2023
    Dataset provided by
    RMIT University, Australia
    Authors
    Vic ciesielski; Ruwan Tennakoon; James Thom; Hiqmat Nisa
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Within the central repository, there are subfolders of different categories. Each of these subfolders contains both images and their corresponding transcriptions, saved as .txt files. As an example, the folder 'summary-based-0001-0055' encompasses 55 handwritten image documents pertaining to the summary task, with the images ranging from 0001 to 0055 within this category. In the transcription files, any crossed-out content is denoted by the '#' symbol, facilitating the easy identification of files with or without such modifications.

    Moreover, there exists a document detailing the transcription rules utilized for transcribing the dataset. Following these guidelines will enable the seamless addition of more images.

    Dataset Description:

    We have incorporated contributions from more than 500 students to construct the dataset. Handwritten examination papers are primary sources in academic institutes to assess student learning. In our experience as academics, we have found that student examination papers tend to be messy with all kinds of insertions and corrections and would thus be a great source of documents for investigating HTR in the wild. Unfortunately, student examination papers are not available due to ethical considerations. So, we created an exam-like situation to collect handwritten samples from students. The corpus of the collected data is academic-based. Usually, in academia, handwritten papers have lines in them. For this purpose, we drew lines using light colors on white paper. The height of a line is 1.5 pt and the space between two lines is 40 pt. The filled handwritten documents were scanned at a resolution of 300 dpi at a grey-level resolution of 8 bits.

    Collection Process: The collection process was done in four different ways. In the first exercise, we asked participants to summarize a given text in their own words. We called it a summary-based dataset. In the summary writing task, we included 60 undergraduate students studying the English language as a subject. After getting their consent, we distributed printed text articles and we asked them to choose one article, read it and summarize it in a paragraph in 15 minutes. The corpus of the printed text articles given to the participants was collected from the Internet on different topics. The articles were related to current political situations, daily life activities, and the Covid-19 pandemic.

    In the second exercise, we asked participants to write an essay from a given list of topics, or they could write on any topic of their choice. We called it an essay-based dataset. This dataset is collected from 250 High school students. We gave them 30 minutes to think about the topic and write for this task.

    In the third exercise, we select participants from different subjects and ask them to write on a topic from their current study. We called it a subject-based dataset. For this study, we used undergraduate students from different subjects, including 33 students from Mathematics, 71 from Biological Sciences, 24 from Environmental Sciences, 17 from Physics, and more than 84 from English studies.

    Finally a class-notes dataset, we have collected class notes from almost 31 students on the same topic. We asked students to take notes of every possible sentence the speaker delivered during the lecture. After finishing the lesson in almost 10 minutes, we asked students to recheck their notes and compare them with other classmates. We did not impose any time restrictions for rechecking. We observed more cross-outs and corrections in class-notes compared to summary-based and academic-based collections.

    In all four exercises, we did not impose any rules on them, for example, spacing, usage of a pen, etc. We asked them to cross out the text if it seemed inappropriate. Although usually writers made corrections in a second read, we also gave an extra 5 minutes for correction purposes.

  10. f

    Repeated Measures data files

    • auckland.figshare.com
    zip
    Updated Nov 9, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gavin T. L. Brown (2020). Repeated Measures data files [Dataset]. http://doi.org/10.17608/k6.auckland.13211120.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 9, 2020
    Dataset provided by
    The University of Auckland
    Authors
    Gavin T. L. Brown
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This zip file contains data files for 3 activities described in the accompanying PPT slides 1. an excel spreadsheet for analysing gain scores in a 2 group, 2 times data array. this activity requires access to –https://campbellcollaboration.org/research-resources/effect-size-calculator.html to calculate effect size.2. an AMOS path model and SPSS data set for an autoregressive, bivariate path model with cross-lagging. This activity is related to the following article: Brown, G. T. L., & Marshall, J. C. (2012). The impact of training students how to write introductions for academic essays: An exploratory, longitudinal study. Assessment & Evaluation in Higher Education, 37(6), 653-670. doi:10.1080/02602938.2011.5632773. an AMOS latent curve model and SPSS data set for a 3-time latent factor model with an interaction mixed model that uses GPA as a predictor of the LCM start and slope or change factors. This activity makes use of data reported previously and a published data analysis case: Peterson, E. R., Brown, G. T. L., & Jun, M. C. (2015). Achievement emotions in higher education: A diary study exploring emotions across an assessment event. Contemporary Educational Psychology, 42, 82-96. doi:10.1016/j.cedpsych.2015.05.002andBrown, G. T. L., & Peterson, E. R. (2018). Evaluating repeated diary study responses: Latent curve modeling. In SAGE Research Methods Cases Part 2. Retrieved from http://methods.sagepub.com/case/evaluating-repeated-diary-study-responses-latent-curve-modeling doi:10.4135/9781526431592

  11. n

    Data from: ‘Tidy’ and ‘messy’ management alters natural enemy communities...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Aug 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Monika Egerer (2022). ‘Tidy’ and ‘messy’ management alters natural enemy communities and pest control in urban agroecosystems [Dataset]. http://doi.org/10.7291/D15H4V
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 29, 2022
    Dataset provided by
    Technical University of Munich
    Authors
    Monika Egerer
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Agroecosystem management influences ecological interactions that underpin ecosystem services. In human-centered systems, people’s values and preferences influence management decisions. For example, aesthetic preferences for ‘tidy’ agroecosystems may remove vegetation complexity with potential negative impacts on beneficial associated biodiversity and ecosystem function. This may produce trade-offs in aesthetic- versus production-based management for ecosystem service provision. Yet, it is unclear how such preferences influence the ecology of small-scale urban agroecosystems, where aesthetic preferences for ‘tidiness’ are prominent among some gardener demographics. We used urban community gardens as a model system to experimentally test how aesthetic preferences for a ‘tidy garden’ versus a ‘messy garden’ influence insect pests, natural enemies, and pest control services. We manipulated gardens by mimicking a popular ‘tidy’ management practice – woodchip mulching – on the one hand, and simulating ‘messy’ gardens by adding ‘weedy’ plants to pathways on the other hand. Then, we measured for differences in natural enemy biodiversity (abundance, richness, community composition), and sentinel pest removal as a result of the tidy/messy manipulation. In addition, we measured vegetation and ground cover features of the garden system as measures of practices already in place. The tidy/messy manipulation did not significantly alter natural enemy or herbivore abundance within garden plots. The manipulation did, however, produce different compositions of natural enemy communities before and after the manipulation. Furthermore, the manipulation did affect short term gains and losses in predation services: the messy manipulation immediately lowered aphid pest removal compared to the tidy manipulation, while mulch already present in the system lowered Lepidoptera egg removal. Aesthetic preferences for ‘tidy’ green spaces often dominate urban landscapes. Yet, in urban food production systems, such aesthetic values and management preferences may create a fundamental tension in the provision of ecosystem services that support sustainable urban agriculture. Though human preferences may be hard to change, we suggest that gardeners allow some ‘messiness’ in their garden plots as a “lazy gardener” approach may promote particular natural enemy assemblages and may have no downsides to natural predation services.

  12. d

    Microbial Community Composition Data from Blacktail Creek near Williston,...

    • datasets.ai
    • data.usgs.gov
    • +1more
    55
    Updated Aug 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of the Interior (2024). Microbial Community Composition Data from Blacktail Creek near Williston, North Dakota [Dataset]. https://datasets.ai/datasets/microbial-community-composition-data-from-blacktail-creek-near-williston-north-dakota
    Explore at:
    55Available download formats
    Dataset updated
    Aug 6, 2024
    Dataset authored and provided by
    Department of the Interior
    Area covered
    Williston, North Dakota
    Description

    A large spill of wastewater from oil and gas operations was discovered adjacent to Blacktail Creek near Williston, North Dakota in January 2015. To determine the effects of this spill on streambed microbial communities over time, bed sediment samples were taken from Blacktail Creek upstream, adjacent to, and at several locations downstream from the spill site. Blacktail Creek is a tributary of the Little Muddy River, and additional samples were taken upstream and downstream from the confluence of Blacktail Creek and the Little Muddy River. Samples were collected in February 2015, June 2015, June 2016, and June 2017. DNA was extracted from these sediments, and sequencing of the 16S ribosomal RNA gene was performed to enable analysis of the microbial community structure. Raw sequence data was processed, and taxonomy was assigned based on the Silva 132 database (Yilmaz et al, 2014) using the MOTHUR software package (Schloss et al, 2009). Raw sequence data are available from GenBank at https://www.ncbi.nlm.nih.gov/bioproject/PRJNA666160.

  13. N

    Muddy, IL Population Breakdown by Gender and Age Dataset: Male and Female...

    • neilsberg.com
    csv, json
    Updated Feb 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neilsberg Research (2025). Muddy, IL Population Breakdown by Gender and Age Dataset: Male and Female Population Distribution Across 18 Age Groups // 2025 Edition [Dataset]. https://www.neilsberg.com/research/datasets/e1f38c67-f25d-11ef-8c1b-3860777c1fe6/
    Explore at:
    json, csvAvailable download formats
    Dataset updated
    Feb 24, 2025
    Dataset authored and provided by
    Neilsberg Research
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Illinois, Muddy
    Variables measured
    Male and Female Population Under 5 Years, Male and Female Population over 85 years, Male and Female Population Between 5 and 9 years, Male and Female Population Between 10 and 14 years, Male and Female Population Between 15 and 19 years, Male and Female Population Between 20 and 24 years, Male and Female Population Between 25 and 29 years, Male and Female Population Between 30 and 34 years, Male and Female Population Between 35 and 39 years, Male and Female Population Between 40 and 44 years, and 8 more
    Measurement technique
    The data presented in this dataset is derived from the latest U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates. To measure the three variables, namely (a) Population (Male), (b) Population (Female), and (c) Gender Ratio (Males per 100 Females), we initially analyzed and categorized the data for each of the gender classifications (biological sex) reported by the US Census Bureau across 18 age groups, ranging from under 5 years to 85 years and above. These age groups are described above in the variables section. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
    Dataset funded by
    Neilsberg Research
    Description
    About this dataset

    Context

    The dataset tabulates the population of Muddy by gender across 18 age groups. It lists the male and female population in each age group along with the gender ratio for Muddy. The dataset can be utilized to understand the population distribution of Muddy by gender and age. For example, using this dataset, we can identify the largest age group for both Men and Women in Muddy. Additionally, it can be used to see how the gender ratio changes from birth to senior most age group and male to female ratio across each age group for Muddy.

    Key observations

    Largest age group (population): Male # 65-69 years (6) | Female # 55-59 years (13). Source: U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.

    Content

    When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.

    Age groups:

    • Under 5 years
    • 5 to 9 years
    • 10 to 14 years
    • 15 to 19 years
    • 20 to 24 years
    • 25 to 29 years
    • 30 to 34 years
    • 35 to 39 years
    • 40 to 44 years
    • 45 to 49 years
    • 50 to 54 years
    • 55 to 59 years
    • 60 to 64 years
    • 65 to 69 years
    • 70 to 74 years
    • 75 to 79 years
    • 80 to 84 years
    • 85 years and over

    Scope of gender :

    Please note that American Community Survey asks a question about the respondents current sex, but not about gender, sexual orientation, or sex at birth. The question is intended to capture data for biological sex, not gender. Respondents are supposed to respond with the answer as either of Male or Female. Our research and this dataset mirrors the data reported as Male and Female for gender distribution analysis.

    Variables / Data Columns

    • Age Group: This column displays the age group for the Muddy population analysis. Total expected values are 18 and are define above in the age groups section.
    • Population (Male): The male population in the Muddy is shown in the following column.
    • Population (Female): The female population in the Muddy is shown in the following column.
    • Gender Ratio: Also known as the sex ratio, this column displays the number of males per 100 females in Muddy for each age group.

    Good to know

    Margin of Error

    Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

    Custom data

    If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

    Inspiration

    Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

    Recommended for further research

    This dataset is a part of the main dataset for Muddy Population by Gender. You can refer the same here

  14. U

    Model Archive Summary for Suspended-Sediment Concentration at U.S....

    • data.usgs.gov
    • gimi9.com
    • +2more
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rodney Richards; Natalie Day, Model Archive Summary for Suspended-Sediment Concentration at U.S. Geological Survey Site 385903107210800; Muddy Creek above Paonia Reservoir, Colorado [Dataset]. http://doi.org/10.5066/P9USKIRP
    Explore at:
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Authors
    Rodney Richards; Natalie Day
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Area covered
    Paonia Reservoir, Colorado, Muddy Creek
    Description

    This model archive summary documents the suspended-sediment concentration (SSC) model developed to estimate 15-minute SSC at Muddy Creek above Paonia Reservoir, U.S. Geological Survey (USGS) site number 385903107210800. The methods used follow USGS guidance as referenced in relevant Office of Surface Water Technical Memorandum (TM) 2016.07 and Office of Water Quality TM 2016.10, and USGS Techniques and Methods, book 3, chap. C5 (Landers and others, 2016). A total of 438 suspended-sediment samples were collected during the calibration period. Forty-one of these samples (22 equal-width-interval [EWI] samples and 19 single-point pump samples) were used in the model calibration dataset. These 41 samples were collected over the range of observed streamflow, Sediment Corrected Backscatter (SCB), and Sediment Attenuation Coefficient (SAC) conditions. Samples used in calibration were plotted on duration curve plots for streamflow from March 2005 to November 2016 (Colorado Division of Wat ...

  15. d

    Disjoint-DABS: A Benchmark for Dynamic Aspect-Based Summarization in...

    • search.dataone.org
    Updated Mar 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Guo, Xiaobo; Vosoughi, Soroush (2024). Disjoint-DABS: A Benchmark for Dynamic Aspect-Based Summarization in Disordered Texts [Dataset]. http://doi.org/10.7910/DVN/OEE1RI
    Explore at:
    Dataset updated
    Mar 6, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Guo, Xiaobo; Vosoughi, Soroush
    Description

    This is the dataset for the paper "Disjoint-DABS: A Benchmark for Dynamic Aspect-Based Summarization in Disorganized Texts". It includes two sub-datasets converted from CNN/DailyMail (D-CnnDM.zip) and WikiHow (D-WikiHow.zip). We include the data with training, validation, and test split. The file for training the summarization model is at (WikiHowSep.zip and CnnDM.zip) We also include the small-scale data for D-WikiHow used for prompting experiments (D-WikiHow-sample). The generated summaries for all baselines for further research, especially for human evaluation is included (result.zip).

  16. n

    Data from: Generalizable EHR-R-REDCap pipeline for a national...

    • data.niaid.nih.gov
    • zenodo.org
    • +1more
    zip
    Updated Jan 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller (2022). Generalizable EHR-R-REDCap pipeline for a national multi-institutional rare tumor patient registry [Dataset]. http://doi.org/10.5061/dryad.rjdfn2zcm
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 9, 2022
    Dataset provided by
    Massachusetts General Hospital
    Harvard Medical School
    Authors
    Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Objective: To develop a clinical informatics pipeline designed to capture large-scale structured EHR data for a national patient registry.

    Materials and Methods: The EHR-R-REDCap pipeline is implemented using R-statistical software to remap and import structured EHR data into the REDCap-based multi-institutional Merkel Cell Carcinoma (MCC) Patient Registry using an adaptable data dictionary.

    Results: Clinical laboratory data were extracted from EPIC Clarity across several participating institutions. Labs were transformed, remapped and imported into the MCC registry using the EHR labs abstraction (eLAB) pipeline. Forty-nine clinical tests encompassing 482,450 results were imported into the registry for 1,109 enrolled MCC patients. Data-quality assessment revealed highly accurate, valid labs. Univariate modeling was performed for labs at baseline on overall survival (N=176) using this clinical informatics pipeline.

    Conclusion: We demonstrate feasibility of the facile eLAB workflow. EHR data is successfully transformed, and bulk-loaded/imported into a REDCap-based national registry to execute real-world data analysis and interoperability.

    Methods eLAB Development and Source Code (R statistical software):

    eLAB is written in R (version 4.0.3), and utilizes the following packages for processing: DescTools, REDCapR, reshape2, splitstackshape, readxl, survival, survminer, and tidyverse. Source code for eLAB can be downloaded directly (https://github.com/TheMillerLab/eLAB).

    eLAB reformats EHR data abstracted for an identified population of patients (e.g. medical record numbers (MRN)/name list) under an Institutional Review Board (IRB)-approved protocol. The MCCPR does not host MRNs/names and eLAB converts these to MCCPR assigned record identification numbers (record_id) before import for de-identification.

    Functions were written to remap EHR bulk lab data pulls/queries from several sources including Clarity/Crystal reports or institutional EDW including Research Patient Data Registry (RPDR) at MGB. The input, a csv/delimited file of labs for user-defined patients, may vary. Thus, users may need to adapt the initial data wrangling script based on the data input format. However, the downstream transformation, code-lab lookup tables, outcomes analysis, and LOINC remapping are standard for use with the provided REDCap Data Dictionary, DataDictionary_eLAB.csv. The available R-markdown ((https://github.com/TheMillerLab/eLAB) provides suggestions and instructions on where or when upfront script modifications may be necessary to accommodate input variability.

    The eLAB pipeline takes several inputs. For example, the input for use with the ‘ehr_format(dt)’ single-line command is non-tabular data assigned as R object ‘dt’ with 4 columns: 1) Patient Name (MRN), 2) Collection Date, 3) Collection Time, and 4) Lab Results wherein several lab panels are in one data frame cell. A mock dataset in this ‘untidy-format’ is provided for demonstration purposes (https://github.com/TheMillerLab/eLAB).

    Bulk lab data pulls often result in subtypes of the same lab. For example, potassium labs are reported as “Potassium,” “Potassium-External,” “Potassium(POC),” “Potassium,whole-bld,” “Potassium-Level-External,” “Potassium,venous,” and “Potassium-whole-bld/plasma.” eLAB utilizes a key-value lookup table with ~300 lab subtypes for remapping labs to the Data Dictionary (DD) code. eLAB reformats/accepts only those lab units pre-defined by the registry DD. The lab lookup table is provided for direct use or may be re-configured/updated to meet end-user specifications. eLAB is designed to remap, transform, and filter/adjust value units of semi-structured/structured bulk laboratory values data pulls from the EHR to align with the pre-defined code of the DD.

    Data Dictionary (DD)

    EHR clinical laboratory data is captured in REDCap using the ‘Labs’ repeating instrument (Supplemental Figures 1-2). The DD is provided for use by researchers at REDCap-participating institutions and is optimized to accommodate the same lab-type captured more than once on the same day for the same patient. The instrument captures 35 clinical lab types. The DD serves several major purposes in the eLAB pipeline. First, it defines every lab type of interest and associated lab unit of interest with a set field/variable name. It also restricts/defines the type of data allowed for entry for each data field, such as a string or numerics. The DD is uploaded into REDCap by every participating site/collaborator and ensures each site collects and codes the data the same way. Automation pipelines, such as eLAB, are designed to remap/clean and reformat data/units utilizing key-value look-up tables that filter and select only the labs/units of interest. eLAB ensures the data pulled from the EHR contains the correct unit and format pre-configured by the DD. The use of the same DD at every participating site ensures that the data field code, format, and relationships in the database are uniform across each site to allow for the simple aggregation of the multi-site data. For example, since every site in the MCCPR uses the same DD, aggregation is efficient and different site csv files are simply combined.

    Study Cohort

    This study was approved by the MGB IRB. Search of the EHR was performed to identify patients diagnosed with MCC between 1975-2021 (N=1,109) for inclusion in the MCCPR. Subjects diagnosed with primary cutaneous MCC between 2016-2019 (N= 176) were included in the test cohort for exploratory studies of lab result associations with overall survival (OS) using eLAB.

    Statistical Analysis

    OS is defined as the time from date of MCC diagnosis to date of death. Data was censored at the date of the last follow-up visit if no death event occurred. Univariable Cox proportional hazard modeling was performed among all lab predictors. Due to the hypothesis-generating nature of the work, p-values were exploratory and Bonferroni corrections were not applied.

  17. i

    WSP Global Scaling up Handwashing Behavior Impact Evaluation, Baseline and...

    • datacatalog.ihsn.org
    • catalog.ihsn.org
    • +1more
    Updated Mar 29, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Water and Sanitation Program (2019). WSP Global Scaling up Handwashing Behavior Impact Evaluation, Baseline and Endline Surveys 2009-2011 - Peru [Dataset]. https://datacatalog.ihsn.org/catalog/4685
    Explore at:
    Dataset updated
    Mar 29, 2019
    Dataset authored and provided by
    Water and Sanitation Program
    Time period covered
    2008 - 2011
    Area covered
    Peru
    Description

    Abstract

    In Peru, the handwashing project targets mothers/caregivers of children under five years old, and it is aimed at improving handwashing with soap practices. Children under five represent the age group most susceptible to diarrheal disease and acute respiratory infections, which are two major causes of childhood morbidity and mortality in less developed countries.

    These infections, usually transferred from dirty hands to food or water sources, or by direct contact with the mouth, can be prevented if mothers/caregivers wash their hands with soap at critical times (such as before feeding a child, cooking, eating, and after using a toilet or changing a child’s diapers). In an effort to improve handwashing behavior, the intervention borrows from both commercial and social marketing fields. This entails the design of communications campaigns and messages likely to bring about the desired behavior changes, and delivering them strategically so that the target audiences are “surrounded” by handwashing promotion.

    Some key elements of the intervention include: • Key behavioral concepts or triggers for each target audience • Persuasive arguments stating why and how a given concept or trigger will lead to behavior change, and • Communication ideas to convey the concepts through many integrated activities and communication channels.

    The objective of the IE is to assess the effects of the project on individual-level handwashing behavior and practices of caregivers and children. By introducing exogenous variation in handwashing promotion (through randomized exposure to the project), the IE also addresses important issues related to the effect of intended behavioral change on child health and development outcomes. In particular, it provides information on the extent to which improved handwashing behavior impacts infant health and welfare.

    Geographic coverage

    The sample included in the IE study is not representative of the Peruvian population at the national level because the selection of provinces and districts was random and not weighted by population, as would be necessary to be geographically representative. Because populations differ across provinces and districts, the three-stage sampling design introduced a type of bias (with respect to geographical representativeness) because selection probabilities varied across administrative units.

    Analysis unit

    • Household
    • Person
    • Caregiver
    • Child (under 5 and under 2)

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    The primary objective of the project is to improve the health and welfare of young children. The sample size (total number of households) was chosen to capture a minimum effect size of 20 percent on the key outcome indicator of diarrhea prevalence among children under two years old at the time of the baseline. The selection of households with children in this age group was made under the assumption that health outcome measurements for young children in this age range are most sensitive to changes in hygiene in the environment. Data was collected for household members of all age ranges and the corresponding data analysis was conducted for older children and adults as well. Power calculations indicated that, in order to capture a 20 percent reduction in diarrhea incidence, around 600 households per treatment arm would need to be surveyed. Therefore, since the evaluation consists of three treatment groups and two control groups, the final sample incorporates approximately 3,000 households, each with children less than two years of age at the time the survey was conducted. An additional 500 households were added to the sample size in order to address potential attrition (loss of participants during the project); thus the minimal necessary sample size was 3,500 households (around 700 households per arm).

    To select the sample, the IE team used a three-stage sampling methodology: • Stage 1: Province Level

    From 195 total provinces in Peru, Pisco and Lima were excluded at the request of the implementation team.2 Of the remaining 193 provinces, 80 provinces were randomly chosen. Out of these 80 provinces, two groups of 40 provinces each were randomly formed: Group of Provinces 1 (GP1) and Group of Provinces 2 (GP2). • Stage 2: District Level

    In order to assess the impact of each of the components of the project in the health of children younger than five years old, the evaluation study has two main treatments, that is, one per component. These are the Mass Media Treatment at the provincial level, also referred to as Treatment 1 (T1), and the Social Mobilization Treatment at the district level, also referred to as Treatment 2 (T2). In order to evaluate and identify the health impacts of each component, a counterfactual to T1 and T2 is needed, which we refer to as the Control (C). The three groups, T1, T2, and C include households with children under two years old at the time of the baseline.

    Out of the first group of 40 provinces, GP1, 40 districts between 1,500 and 100,000 habitants were randomly chosen to receive T1. From the second group, GP2, 80 districts between 1,500 and 100,000 habitants were selected randomly; 40 of them were randomly assigned to receive T2, and the other 40 districts to serve as C to T1 and T2.

    • Stage 3: Household Level For each of the three sets of 40 districts (120 districts total) allocated to T1, T2, and C, 15-20 households with children under two years of age were selected at random in each district. Also, in each of the 40 districts

    Mode of data collection

    Face-to-face [f2f]

    Research instrument

    The following instruments were used to collect the data: • Household questionnaire: The household questionnaire was conducted in all households and was designed to collect data on household membership, education, labor, income, assets, dwelling characteristics,water sources, drinking water, sanitation,observations of handwashing facilities and other dwelling characteristics, handwashing behavior, child discipline, maternal depression, handwashing determinants, exposure to health interventions, relationship between family and school, and mortality.

    • Health questionnaire: The health questionnaire was conducted in all households and designed to collect data on children’s diarrhea prevalence, ALRI and other health symptoms, child development, child growth, and anemia.

    • Community questionnaire: The community questionnaire was conducted in 120 districts to collect data on community/districts variables.

    • Structured observations: Structured observations were conducted in a subsample of 160 households to collect data on direct observation of handwashing behavior.

    • Water samples: Water samples were collected in a subsample of 160 households, to identify Escherichia coli (E. coli) presence in hand rinses (mother and children), sentinel toy, and drinking water.

    • Stool samples: Stool samples were collected in a subsample of 160 households to identify prevalence of parasites in children’s feces.

    Cleaning operations

    Baseline: The baseline survey was processed using the assistance of Sistemas Integrales in Chile. A manual for the data entry system is attached under the title of: Data Entry Manual:Baseline.

    Endline: Kimetrica International was contracted to design the data reduction system to be used during the endline. The data entry system was designed in CSPro (Version 4.1) using the DHS file management system as a standard for file management. Details of the system can be found in the attached manual entitled: Data Entry Manual for the Endline Survey.

    The data entry system was based on a full double data entry (independent verification) of the various questionnaires. CSPro supports both dependent and independent verification (double keying) to ensure the accuracy of the data entry operation. Using independent verification, operators can key data into separate data files and use CSPro utilities to compare them and produce a report that indicates discrepancies in data entry.

    The DHS system uses a fully integrated tracking system to follow the stages in the data entry process. This includes the checking in of questionnaires; the programming of logic in what is known as a system controlled environment. System controlled applications generally place more restrictions on the data entry operator. This is typically used for complex survey applications. The behavior of these applications at data entry time has the following characteristics:

    • Some special data entry keys are not active during data entry.
    • CSEntry will keep track of the path.
    • 'Not applicable' or blanks values will not be allowed. Missing values have to be coded.
    • More appropriate to the heads up methodology of data capture.
    • Logic in the application is strictly enforced; operator cannot bypass or override.

    Files were processed using the unique cluster number and then concatenated after a final stage of editing and output to both SPSS and STATA.

    Furthermore, attempts were made to respect the values and the naming conventions as provided in the baseline. This required using non-conventional values for “missing” such as -99. In most cases the same value sets were applied or during the questionnaire review process the WSP was alerted to such discrepancies.

    Response rate

    Baseline 1 Completed interview -----> 3508 --->94.3
    2 Incomplete interview ----->48 --->1.3
    3 Not available ----->7 --->.2
    4 Rescheduled interview ----->7 --->.2
    5 Nobody at home ----->48 --->1.3
    6 Temporarily away ----->59 --->1.6 7 Refused to participate ----->44 --->1.2

    Total

  18. Week 4: Techniques and Sampling - Pan Trap Dataset Gathering int he Danby...

    • figshare.com
    xlsx
    Updated Jan 20, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nina Lee (2016). Week 4: Techniques and Sampling - Pan Trap Dataset Gathering int he Danby Woodlot & Grasslands at York University [Dataset]. http://doi.org/10.6084/m9.figshare.1565577.v2
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jan 20, 2016
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Nina Lee
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    On Tuesday September 29th at approximately from 2:45-5:30 pm, an experiment for the Pan Trap Dataset was conducted with the assistance of my four other group members including Ashley, Adam, Katherine, and Kate. Initially, the weather was partially cloudy with drizzling rain and as the time went on the rain got heavier, there was less sunlight and there was a cool breeze. As hypothesized before gathering our sampling data, the rain weather would have an influence on the abundance of insects that would come out of the soil and become observable on the surface of the ground. In other words, it was predicted that in most cases of insects, there would be a negative correlation between the amount of moisture gathered on the surface of the ground and the frequency at which insects would be observed. For this particular dataset experiment, three different colours of Solo Bowls (white, blue and yellow) were filled with soapy water and each placed in a group. Each group, consisting of three different coloured bowls were positioned at random places in the two distinct pre-set locations at the Danby Woodlot and the Grassland area. The reason for which three different colours were used was because of attempting to target variety of different insects which would be attracted to certain colours—such as bees which are specifically attracted to the colour yellow for pollination of flowers—and therefore, this way our data would be unbiased and would include most insects that were present in that environment. In order to make sure that we were doing random sampling and further lower the chance of bias in our data collection, each group of three bowls were placed at different random locations within the Woodlot or Grassland site. Similarly, we have placed the bowls in two different environmental settings of Woodlot and Grassland areas due to attempting to gather unbiased and random data. This way, perhaps more variety of insects present in different environmental settings can be gathered. At the end, as it was observed and counted, there was more variety of insects present in the Grassland area. Perhaps, insects were able to take shelter among the heavy profusion of grass on the surface of the ground instead of bare surface of the damp Earth which was more observed in the Woodlot area. Insects can take refuge in hollow logs and in between tree branches and trunks during rain, so perhaps that is the reason why they were noticeably absent in the data samples from the Woodlot area. Overall, our hypothesis was to some extent correct when it came to considering the results of our data collection from the Woodlot area. Not many insects were present on the damp muddy soil of the Woodlot due to heavy pouring rain. It is highly likely that most insects were hiding and taking shelter beneath the ground or in hollow tree trunks for their own survival.

  19. Energy Consumption of United States Over Time

    • kaggle.com
    Updated Dec 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Energy Consumption of United States Over Time [Dataset]. https://www.kaggle.com/datasets/thedevastator/unlocking-the-energy-consumption-of-united-state
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 14, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Area covered
    United States
    Description

    Energy Consumption of United States Over Time

    Building Energy Data Book

    By Department of Energy [source]

    About this dataset

    The Building Energy Data Book (2011) is an invaluable resource for gaining insight into the current state of energy consumption in the buildings sector. This dataset provides comprehensive data on residential, commercial and industrial building energy consumption, construction techniques, building technologies and characteristics. With this resource, you can get an in-depth understanding of how energy is used in various types of buildings - from single family homes to large office complexes - as well as its impact on the environment. The BTO within the U.S Department of Energy's Office of Energy Efficiency and Renewable Energy developed this dataset to provide a wealth of knowledge for researchers, policy makers, engineers and even everyday observers who are interested in learning more about our built environment and its energy usage patterns

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset provides comprehensive information regarding energy consumption in the buildings sector of the United States. It contains a number of key variables which can be used to analyze and explore the relations between energy consumption and building characteristics, technologies, and construction. The data is provided in both CSV format as well as tabular format which can make it helpful for those who prefer to use programs like Excel or other statistical modeling software.

    In order to get started with this dataset we've developed a guide outlining how to effectively use it for your research or project needs.

    • Understand what's included: Before you start analyzing the data, you should read through the provided documentation so that you fully understand what is included in the datasets. You'll want to be aware of any potential limitations or requirements associated with each type of data point so that your results are valid and reliable when drawing conclusions from them.

    • Clean up any outliers: You may need to take some time upfront investigating suspicious outliers within your dataset before using it in any further analyses — otherwise, they can skew results down the road if not dealt with first-hand! Furthermore, they could also make complex statistical modeling more difficult as well since they artificially inflate values depending on their magnitude within each example data point (i.e., one outlier could affect an entire model’s prior distributions). Missing values should also be accounted for too since these may not always appear obvious at first glance when reviewing a table or graphical representation - but accurate statistics must still be obtained either way no matter how messy things seem!

    • Exploratory data analysis: After cleaning up your dataset you'll want to do some basic exploring by visualizing different types of summaries like boxplots, histograms and scatter plots etc.. This will give you an initial case into what trends might exist within certain demographic/geographic/etc.. regions & variables which can then help inform future predictive models when needed! Additionally this step will highlight any clear discontinuous changes over time due over-generalization (if applicable), making sure predictors themselves don’t become part noise instead contributing meaningful signals towards overall effect predictions accuracy etc…

    • Analyze key metrics & observations: Once exploratory analyses have been carried out on rawsamples post-processing steps are next such as analyzing metrics such ascorrelations amongst explanatory functions; performing significance testing regression models; imputing missing/outlier values and much more depending upon specific project needs at hand… Additionally – interpretation efforts based

    Research Ideas

    • Creating an energy efficiency rating system for buildings - Using the dataset, an organization can develop a metric to rate the energy efficiency of commercial and residential buildings in a standardized way.
    • Developing targeted campaigns to raise awareness about energy conservation - Analyzing data from this dataset can help organizations identify areas of high energy consumption and create targeted campaigns and incentives to encourage people to conserve energy in those areas.
    • Estimating costs associated with upgrading building technologies - By evaluating various trends in building technologies and their associated costs, decision-makers can determine the most cost-effective option when it comes time to upgrade their structures' energy efficiency...
  20. f

    Data_Sheet_4_A Flexible, Extensible, Machine-Readable, Human-Intelligible,...

    • figshare.com
    docx
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gideon Kruseman (2023). Data_Sheet_4_A Flexible, Extensible, Machine-Readable, Human-Intelligible, and Ontology-Agnostic Metadata Schema (OIMS).DOCX [Dataset]. http://doi.org/10.3389/fsufs.2022.767863.s004
    Explore at:
    docxAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Frontiers
    Authors
    Gideon Kruseman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This paper presents a lightweight, flexible, extensible, machine readable and human-intelligible metadata schema that does not depend on a specific ontology. The metadata schema for metadata of data files is based on the concept of data lakes where data is stored as they are. The purpose of the schema is to enhance data interoperability. The lack of interoperability of messy socio-economic datasets that contain a mixture of structured, semi-structured, and unstructured data means that many datasets are underutilized. Adding a minimum set of rich metadata and describing new and existing data dictionaries in a standardized way goes a long way to make these high-variety datasets interoperable and reusable and hence allows timely and actionable information to be gleaned from those datasets. The presented metadata schema OIMS can help to standardize the description of metadata. The paper introduces overall concepts of metadata, discusses design principles of metadata schemes, and presents the structure and an applied example of OIMS.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177

Data Cleaning Sample

Explore at:
151 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 13, 2023
Dataset provided by
Borealis
Authors
Rong Luo
License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

Sample data for exercises in Further Adventures in Data Cleaning.

Search
Clear search
Close search
Google apps
Main menu