Data science take home challenge dataset


  • How to Apply for a Data Science Job and How to Prepare for Interviews
  • Has led to offers at:
  • Data Science Project Scoping Guide
  • Learn data skills.
  • Data Scientist Interview Questions
  • How to Apply for a Data Science Job and How to Prepare for Interviews

    It is now time for the most important step in the interview process, namely, the take-home coding challenge. This is generally a data science problem, e. Data science coding projects vary in scope and complexity. Sometimes, the project could be as simple as producing summary statistics, charts, and visualizations.

    It could also involve building a regression model, classification model, or forecasting using a time-dependent dataset. The project could also be very complex and difficult. In this case, no clear guidance is provided as to the specific type of model to use.

    Generally, the interview team will provide you with project directions and a dataset. If you are fortunate, they may provide a small dataset that is clean and stored in a comma-separated value CSV file format. For the couple of interviews I had, I worked with 2 types of datasets: one had observations rows , while the other had 50, observations with lots of missing values.

    The take-home coding exercise clearly differs from companies to companies, as further described below. In this article, I will share some useful tips from my personal experience that would help you excel in the coding challenge project. Sample 1 Coding Exercise: Model for recommending cruise ship crew size Instructions This coding exercise should be performed in python which is the programming language used by the team. You are free to use the internet and any other libraries.

    Please save your work in a Jupyter notebook and email it to us for review. Please do the following steps hint: use numpy, scipy, pandas, sklearn and matplotlib Read the file and display columns.

    Calculate basic statistics of the data count, mean, std, etc and examine data and state your observations. If you removed columns, explain why you removed those. Use one-hot encoding for categorical features. Calculate the Pearson correlation coefficient for the training set and testing datasets. Describe hyper-parameters in your model and how you would change them to improve the performance of the model.

    What is regularization? What is the regularization parameter in your model? Plot regularization parameter value vs Pearson correlation for the test and training sets, and see whether your model has a bias problem or variance problem. This is an example of a very straightforward problem.

    The dataset is clean and small rows and 9 columns , and the instructions are very clear. So, all that is needed is to follow the instructions and generate your code. Notice also that the instruction clearly specifies that python must be used as the programming language for model building. The time allowed for completing this coding assignment was three days.

    Only the final Jupyter notebook has to be submitted, and no formal project report is required. Tips for Acing Sample 1 Coding Exercise Since the project involves building a machine learning model, the first step is to ensure we understand the machine learning process: Figure 1. Illustrating the Machine Learning Process. Image by Benjamin O. Problem Framing Define your project goals.

    What do you want to find out? Do you have the data to analyze? Data Analysis Import and clean the dataset, analyze features to select the relevant features that correlate with the target variable. In this example, the dataset is clean and pristine, with no missing values. So, no cleaning is required. Remarks on Data Quality: One of the major flaws with the dataset is that it does not provide the units for the features. The units for cabin length, passenger density, and crew are not provided as well.

    These kinds of issues can be addressed by contacting the interview team to ask more about the dataset. It is important to understand the intricacies of your data before using it for building real-world models. Keep in mind that a bad dataset leads to bad predictive models. We observe from Figure 2 that there are strong correlations between features.

    Figure 2. Covariance matrix plot. This is important because multi-collinearity between features can lead to a model that is complex and difficult to interpret. PCA can also be used for variable selection and dimensionality reduction. In this case, only components that contribute significantly to the total explained variance can be retained and used for modeled building.

    Model Building Pick the machine learning tool that matches your data and desired outcome. Train the model with available data. The dataset has to be divided into training, validation, and test sets. Hyperparameter tuning has to be used to fine-tune the model in order to prevent overfitting.

    Cross-validation is essential to ensure the model performs well on the validation set. After fine-tuning model parameters, the model is applied has to be applied to the test dataset. Figure 3. Mean cross-validation shows for different regression models. Application Score your final model to generate predictions. Make your model available for production.

    Retrain your model as needed. In this stage, the final machine learning model is selected and put into production. The model is evaluated in a production setting in order to assess its performance. Any mistakes encountered when transforming from an experimental model to its actual performance on the production line has to be analyzed.

    This can then be used in fine-tuning the original model. Based on the mean cross-validation score from Figure 3, we observe that Linear Regression and Support Vector Regression perform almost at the same level and better than KNeighbors Regression.

    For a complete solution of sample 1 coding exercise, please see the following links: Machine Learning Process Tutorial Remarks on Sample 1 Coding Exercise Sometimes the coding exercise would ask you to submit a Jupyter notebook only, or it may ask for a full project report.

    Make sure your Jupyter notebook is well organized to reflect every stage of the machine learning process. Sample 2 Coding Exercise: Model for forecasting loan status Instructions In this problem, you will forecast the outcome of a portfolio of loans. Each loan is scheduled to be repaid over 3 years and is structured as follows: First, the borrower receives the funds. This event is called origination. The borrower then makes regular repayments until one of the following happens: i The borrower stops making payments, typically due to financial hardship, before the end of the 3-year term.

    This event is called charge-off, and the loan is then said to have charged off. At this point, the debt has been fully repaid. In the attached CSV, each row corresponds to a loan, and the columns are defined as follows: The column with header days since origination indicates the number of days that elapsed between origination and the date when the data was collected.

    For loans that charged off before the data was collected, the column with header days from origination to charge-off indicates the number of days that elapsed between origination and charge-off.

    For all other loans, this column is blank. Objective: We would like you to estimate what fraction of these loans will have charged off by the time all of their 3-year terms are finished. Please include a rigorous explanation of how you arrived at your answer, and include any code you used. You may make simplifying assumptions, but please state such assumptions explicitly. Feel free to present your answer in whatever format you prefer; in particular, PDF and Jupyter Notebook are both fine.

    Also, we expect that this project will not take more than 3—6 hours of your time. The dataset here is complex has 50, rows and 2 columns, and lots of missing values , and the problem is not very straightforward. You have to examine the dataset critically and then decide what model to use. This problem was to be solved in a week.

    It also specifies that a formal project report and an R script or Jupyter notebook file be submitted. Tips for Acing Sample 2 Coding Exercise As in Sample 1 coding exercise, you need to follow the machine learning steps when tackling this problem.

    This particle problem does not have a unique solution. I attempted a solution using probabilistic modeling based on Monte-Carlo simulation. For a complete solution of sample 1 coding exercise, please see the following links:.

    For a brief walkthrough, please see the Blank Project Scoping Worksheet. For more details on any section in the worksheet, please use this guide.

    There are a lot of organizations out there — government agencies, nonprofits, social enterprises, corporations — working on important problems that can have a huge impact on society. There are also lots of talented, passionate, and smart people with data science skills who can help them tackle those problems.

    Yet, when these two sets of people come together, the results are often mixed because of the challenges associated with formulating a well-scoped project. We have found that it is necessary to have people who can mediate between the two groups and formulate a problem that is both solvable and impactful.

    Although this is written specifically to benefit people scoping data science projects for social good, the lessons here generalize to socially neutral and unfortunately, to socially evil projects as well. Solvable: Data can play a role in solving the problem, and the organization has access to the right data See our data maturity framework to help you assess whether you have the right data Actionable: The organization has prioritized this problem, is ready to take actions based on the work, and is willing to commit resources to validate and implement it.

    There are many approaches to scope a problem. We focus on projects that are actionable and lead to tangible positive societal outcomes. Let us know if you have alternative approaches you use that we can benefit from. As always, the scoping process is fairly iterative and the scope gets refined both during the scoping process as well as during the project.

    Who needs to be involved? Step 0: Problem Understanding — What is the problem? Who does it impact and how much? How is it being solved today and what are some of the gaps? Step 1: Goals — What are the goals of the project? How will we know if our project is successful? Step 2: Actions — What actions or interventions will this work inform? Step 3: Data — What data do you have access to internally?

    What data do you need? Step 4: Analysis — What analysis needs to be done? Does it involve description, detection, prediction, or behavior change? How will the analysis be validated? How will you evaluate the new system in the field to make sure it accomplishes your goals? How will you monitor your system to make sure it continues to perform well over time? The scoping process is iterative and not strictly linear. While understanding the problem and defining the goals of the projects must come before identifying actions, the process of identifying actions may help discern whether a goal is actionable and must be redefined.

    Similarly, assessing the data available for the project may lead us to rethink which problems, goals, and actions can be informed by a data science project. Our analysis may lead us to rethink our problem and our goals and start the scoping process anew. So, while each step follows the previous, we should take each step as an opportunity to evaluate earlier steps.

    All throughout, ethics should be the center of our scoping process. Developing a clear and explicit understanding of your goals in this regard is fundamental to doing it well.

    Crucially, the integration of ethical considerations into the project should be neither an afterthought nor a burden, but rather a critical and continuous area of focus that involves all stakeholders, especially the people who will be impacted by this system. Step 0: Understand the problem Before we start scoping the project, we need to make sure we understand the problem and its impact. The problem should be a priority for the organization that can be addressed using data the organization has or can access.

    If the problem is not a priority, then even a well-designed model will not help resolve it because the organization will lack the motivation to act on the information resulting from the analysis.

    To begin, it is important to understand the scope of the problem. We first ask organizations to describe the problem they are facing, including who or what is affected by the problem, how many are affected, and how much they are affected i. For example, a school district may be concerned with low high school graduation rates. They should be able to tell us who is affected, perhaps low-income students or students who are otherwise at-risk.

    We then ask the organization to explain why the problem is a priority now and how they have been tackling the problem. For example, a school district may be concerned about high school graduation rates because they recently found that they were particularly low among at-risk students. The school district may be enrolling at-risk students in after-school tutoring programs that help reinforce in-school learning.

    If space in after-school programs is limited, then the right analysis could help the school district prioritize students for enrollment who are unlikely to graduate on time. Finally, it is important to identify the groups or stakeholders inside and outside your organization who need to be involved in scoping and implementing the project. Typically, data science projects need involvement from stakeholders inside your organization, such as policymakers, managers, data owners, IT infrastructure owners, and the people who will intervene, such as health workers.

    Projects also often require the involvement of people and groups from outside your organization, such as community groups that will be affected by this work. For example, a school district scoping a data science project identifying students who are at-risk of not graduating from high school will want to engage senior policymakers as well as IT and data workers early in their planning. They will also want to engage school administrators, teachers, and tutoring program workers to help understand the problem and scope the project.

    The district will also want to engage the school board and the parents of students who are likely to be impacted by the project, especially in communities most affected by low graduation rates.

    This is the most critical step in the scoping process. In the context of our projects, a goal is a concrete, specific, measurable aim or outcome that the organization will accomplish by addressing the problem. We often come across efforts where an organization will define the goal of their project as building a technical solution, such as a predictive model, dashboard, or map. We argue that the technical solution model, dashboard, map is not itself the goal of a data science project.

    Most projects start with a very vague and abstract goal say, improving education , get a little more concrete increase the percentage of students who will graduate on time , and keep getting refined until the goal is concrete, unambiguous, and achieves the societal aims of the organization.

    Sometimes, these goals exist but are locked implicitly in the minds of people within the organization. Other times, there are several goals that different parts of the organization are trying to achieve. We will often have possibly conflicting goals around efficiency e. We should not only define these explicitly during the scoping process but also attempt to prioritize them at this stage.

    Usually, goals also include constraints. Constraints are often what make a data science project necessary. For example, if a public health agency could inspect every rental property in the city for housing code violations, they probably would. However, there may be a constraint on the number of properties an agency can inspect within a certain period, so they may want to prioritize the ones most likely to be unsafe to live in.

    A preventative public health program might want their goal to minimize the number of unplanned Emergency Room ER visits. A trivial and evil approach to achieving that would be to shut down all ERs, resulting in zero visits. Adding a constraint requiring the solution to improve health outcomes, and not just reduce ER visits, will help us identify solutions that have the desired social impact.

    Example 1 — Lead Poisoning: In , we worked with the Chicago Department of Public Health on reducing lead poisoning rates among children in Chicago. The initial goal was to reduce lead poisoning by increasing the effectiveness of their limited lead hazard inspections.

    One way to achieve that goal would be to focus inspections on homes that are likely to have lead hazards. Finding a home with lead hazards and getting it remediated is only beneficial if there is a high chance that a child is present in the home currently or in the future who is likely to get exposed to lead and develop lead poisoning. The next iteration of the goal was to increase the number of inspections that find lead hazards in homes where there is an at-risk child before the child gets exposed to lead.

    Eventually, we got to the final goal: Reducing the number of children who will get lead poisoning in the future because of lead hazards in their current residence by 1 identifying which children are at high risk of lead poisoning in the future and then 2 targeting interventions at the homes of those children to remove those lead hazards. Example 2 — On-time High School Graduation: One of the challenges schools are facing today is helping their students graduate on time. They are interested in identifying students who are at risk of not graduating on time and need extra support.

    When initially talking to most school districts, they start with a very narrow goal of predicting which kids are unlikely to graduate on time. The first step in our scoping process is to go back to the goal of increasing graduation rates and ask if there is a specific subset of at-risk students they want to identify?

    If the goal is just to increase graduation rates, the first group is probably easier to intervene with and influence while the second group may be more challenging due to the resources they need.

    Or is the goal to create more equity and reduce the difference in on-time graduation rates between those who are most likely to graduate and those who are least likely to? All of these are reasonable goals but schools have to understand, evaluate, and decide which goals are most important to them. This conversation often makes them think more deeply about defining what their organizational goals are as well as tradeoffs between them. A reasonable goal we may end up with after the scoping process is to minimize the disparities in graduation rates across different racial groups while maximizing overall graduation rates.

    Some examples include: the U. Environmental Protection Agency and New York State Department of Environmental Conservation, prioritizing which facilities to inspect for waste disposal violations; the City of Cincinnati, identifying properties at risk of code violations; and the World Bank Group, prioritizing fraud and collusion complaints to investigate. In most inspection problems, there are many more entities homes, buildings, facilities, businesses, contracts to inspect than the available resources needed to conduct those inspections.

    The goal most organizations start with is focusing their inspections on entities that are likely to be in violation of existing regulations. While this is a good start, most of these organizations can never inspect everything that may be non-compliant. The goal they really want to achieve is deterrence — reducing the total number of facilities that will be in violation.

    They deploy portable toilets across informal urban settlements and one of their largest costs is hiring people to empty the toilets by collecting waste from each of them. Since toilet usage varies, it is inefficient to empty every toilet every day, as it was done when the project started. A different formulation of the goal could be to minimize the number of times an individual goes to use the toilet but cannot because it was full and not usable with the staff resource constraints they have to empty the toilets.

    Considering trade-offs while deciding on goals As we start defining and prioritizing goals, often around efficiency, effectiveness, and equity, the conversation leads to tradeoffs. Would you rather inspect more homes without finding lead hazards in them which is inefficient , or would you rather miss homes with children who will end up getting lead poisoning?

    When dispatching and placing emergency response vehicles, do you want to make sure you can get to every possible emergency within 10 minutes or do you want to make sure that you can get to critical emergencies within 3 minutes and the non-critical within 20 minutes?

    What types of mistakes are you more willing to make? That is a critical question a good scoping process brings up and attempts to answer based on the priorities of the organization.

    In data science terms, would you rather have more false positives or more false negatives? Do you want these false positives or false negatives to be balanced across different racial, gender, age, or socioeconomic groups? Of course, this decision depends on the impact and cost of those errors, which is often hard and sometimes uncomfortable to quantify.

    Tips for Acing Sample 1 Coding Exercise Since the project involves building a machine learning model, the first step is to ensure we understand the machine learning process: Figure 1. Illustrating the Machine Learning Process. Image by Benjamin O. Problem Framing Define your project goals. What do you want to find out? Do you have the data to analyze? Data Analysis Import and clean the dataset, analyze features to select the relevant features that correlate with the target variable.

    In this example, the dataset is clean and pristine, with no missing values. So, no cleaning is required. Remarks on Data Quality: One of the major flaws with the dataset is that it does not provide the units for the features. The units for cabin length, passenger density, and crew are not provided as well. These kinds of issues can be addressed by contacting the interview team to ask more about the dataset.

    It is important to understand the intricacies of your data before using it for building real-world models. Keep in mind that a bad dataset leads to bad predictive models. We observe from Figure 2 that there are strong correlations between features.

    Figure 2. Covariance matrix plot. This is important because multi-collinearity between features can lead to a model that is complex and difficult to interpret. PCA can also be used for variable selection and dimensionality reduction. In this case, only components that contribute significantly to the total explained variance can be retained and used for modeled building. Model Building Pick the machine learning tool that matches your data and desired outcome.

    Train the model with available data. The dataset has to be divided into training, validation, and test sets. Hyperparameter tuning has to be used to fine-tune the model in order to prevent overfitting.

    Cross-validation is essential to ensure the model performs well on the validation set. After fine-tuning model parameters, the model is applied has to be applied to the test dataset. Figure 3. Mean cross-validation shows for different regression models. Application Score your final model to generate predictions. Make your model available for production. Be friendly and sound interested and excited. Always have a few questions to ask the recruiter as well! Lastly, have an open email dialogue with the recruiter.

    If they set up a time to have you talk to the hiring manager, or come in for an onsite interview, ask them if they have any advice or anything that might be helpful to prepare for.

    As the application process goes into the later and later stages, recruiters want to get positions filled, and there is a good chance they will offer at least some advice.

    The more information you can have at each step of the process, the better. How to impress the hiring team with the take-home data assignment Oh, the famed take-home data task.

    Has led to offers at:

    These are based on real assignments from some of the biggest tech companies. In this technical screening step, there are usually two types of tasks: Timed programming challenge. I do not have much advice on how to prepare for these. Data analysis challenge. This is the most frequently used technical screen. It usually involves a dataset and some questions to show off your programming skills as well as your ability to analyze and synthesize results.

    Data Science Project Scoping Guide

    This section is focused on this type of technical challenge. The data analysis challenge is used to evaluate the following: Can you demonstrate the technical skills you discussed in your resume? Are you able to handle and clean messy data? Is your code clean, well-written, and well-documented?

    Which programming languages do you have the most experience with? If you had to choose one algorithm to analyze data, what would it be and why? How do you differentiate between machine learning and deep learning? Can you provide examples?

    Learn data skills.

    You create a data storage system to organize data figures, but it isn't working correctly. Are you comfortable asking others for help? What is the confusion matrix used for? Can you provide an example? How do you decide which models or algorithms to use in analyzing data sets? This is vital to understand as it can cost more time and money to train if the candidate is not knowledgeable in all of the languages and applications required for the position.

    A: This question will determine how the candidate approaches solving real-world issues they will face in their role as a data scientist. It will also determine how they approach problem-solving from an analytical standpoint. This information is vital to understand because data scientists must have strong analytical and problem-solving skills.

    I then evaluate the performance based on criteria set by the lead data scientist or company and discuss my findings with my team lead and group.

    Data Scientist Interview Questions

    Their answer should reveal their inspiration for working for the company and their drive for being a data scientist. Your firm uses advanced technology to address everyday problems for consumers and businesses alike, which I admire. I also enjoy solving issues using an analytical approach and am passionate about incorporating technology into my work.


    Data science take home challenge dataset