Thesis internship - Data Science
Pointlogic, A Nielsen Company helps customers with decision making in the area of media and marketing. Our main assets are Nielsen data and advanced analytics capabilities. Pointlogic’s head office from Rotterdam has a data science team of about 20 people with backgrounds in econometrics / operations research / mathematics. We offer data science / econometrics / operations research / mathematics students the possibility to either work solely on writing their thesis project (part of their MSc curriculum), on internal projects, or on a combination of both.
To give you an idea of the topics we offer, you can read more about the topics of previous collage year below:
1. ACCELERATED PROBABILISTIC PROGRAMMING
Pointlogic uses probabilistic programming extensively for performing various types of analytical inference; currently that inference is performed in Stan. Probabilistic programming is a tool for statistical modeling, closely linked to performing Bayesian Statistics - it seeks to describe probabilistic models and then perform inference in those models.
One of our current technical challenges is that our models are computationally expensive and we do not have satisfactory options for reducing the run time. The goal of this project is to evaluate the various algorithms and technical frameworks available for performing probabilistic programming.
We have a few potential avenues that we would like to explore but are open to other ideas as well:
- We currently use Hamiltonian Monte Carlo, another option is to use Variational inference – which gives approximate answers but in a more timely fashion.
- Evaluating new frameworks for Pointlogic specific models. Recent frameworks include; Edward (which uses Google’s TensorFlow as a backend) & Pyro from Uber AI labs.
- Use of GPUs for large matrix calculations or exploring distributional computing through systems such as MPI.
- General reparameterizations to the model structure that allow for more efficient sampling.
2. MODELLING ADVERTISING EFFECTIVENESS UNDER UNCERTAINTY
The goal of this project is to review a current component of modelling using survey data at Pointlogic. Our respondent level brand health models almost entirely use survey data as fuel for the modelling – the problem with this is that when modelling using survey data we don’t have true measured information around when an individual has been exposed to advertising (except for cases where digital ads are tagged). We have a methodology in-place to help us here, Contact Estimation, which involves combining a number of pieces of information; such as a respondent’s typical media consumption behavior and where and when an advertiser advertised. This information is combined to provide an estimate for the expected number of contacts that an individual will have had, broken down by media channel
We have many potential areas for exploration here:
- Currently the Contact Estimation & Modelling are two separate procedures – would there be added value in having a single procedure that performs the operation concurrently?
- Is using the expectation of number of contact sufficient or is there enough added value in incorporating more of the uncertainty around the number of contacts to justify the added complexity and computational overhead?
- Is there a bias in estimated effects of media by channel after Contact Estimation?
- If so, how large is it and can we construct a methodology to correct for any bias?
3. MACHINE LEARNING FOR DIMENSION REDUCTION
In many projects, Pointlogic uses dimension reduction in order to reduce the number of features used in a model. The most common technique used is principal components. In this project, we will use machine-learning methods (e.g. auto-encoders) to do dimension reduction and compare the performance with commonly used methods.
An important type of projects within Pointlogic is ‘data fusion’. As explained below, dimension reduction is an important step and we would like to examine the added value machine learning can bring to our methodology.
Data fusion is a method of integrating data sets using statistical analytics and modeling in order to create a single data set that incorporates the attributes from both underlying sources. The ‘integrated’ dataset allows us to make analysis that require variables from both sides. For media measurement (panel) integration, the following example of a 1-1 fusion is illustrative:
- Data Set 1: TV panel. This is a data set of television viewing for a group of people. This data set includes demographics (e.g. age, gender, ethnicity, people in the home, income), along with detailed television viewing behavior, and information about computer device ownership and usage behavior. We will let the TV panel be the recipient data set.
- Data Set 2: Radio panel. This is a data set of listening behavior for a separate group of panelists (not in Data set 1). This data set also includes demographics, some measurement of TV and most importantly detailed listening behavior. Let the radio panel be the donor data set.
- Data Set 3: The fused data set. The integrated data set that has the original panelists from the TV panel but these panelists have been assigned listening behaviors that came from the radio panel
To create Data Set 3, we identify like-variables within both Data Set 1 and Data Set 2, which are called linking variables. Using statistical analysis to understand how each of these variables correlates to both viewing of television and listening behavior, we establish importance weights for each variable, including which variables must match exactly between both data sets (always age/gender and various demographics traits).
When the linking variables and their importance are established, we can define a distance function between respondents. The next step, linking donors to recipients while minimizing distances is known in Operations Research as a generalized assignment problem and it is well studied. Standard algorithms can solve the assignment problem very efficiently. The combined dataset can be used to estimate the overlap between TV stations and radio stations, or the joined reach of a cross-media campaign.
4. OUTLIER DETECTION IN TV PANELS
In many countries, Nielsen is responsible for providing TV ratings data: information on how many people watched specific shows on television. These TV ratings are considered a ‘currency’ as advertisers buying spots for TV advertising pay based on the amount of people watching during that time. Therefore having robust, reliable ratings is crucial for Nielsen and the TV buying market. TV ratings are measured via a panel study; the size of this panel varies by country.
The aim of this project is to identify individuals that demonstrate not natural viewing behavior with the use of sophisticated indicators that should be set. Specifically the task is to define a scientific methodology and tools to detect suspicious individuals. Based on this set objective indicators to conclude about provedfalse behavior of individuals that should be excluded from the panel.
Considerations to be taken into account: instructions given to individuals by third parties to watch something could be not constant in time and could be different for different channels, could be quite sophisticated aiming not to be revealed in easy way.
The aim is to explore opportunities of machine learning capabilities and various statistical methods. Available information is viewing data for each individual by each channel and listing of demographic characteristics on daily base. Viewing data is available on viewing statements level. Analysis should be performed along whole period of panel existence with regular periodicity in order to detect false behavior individuals in-time manner upon their appearing. The moment when household is disclosed and contacted by third parties with instruction to watch something in specific is not known, it could be in the beginning of household inclusion into the panel or any day during the whole period of presence in the panel.
6. SWARM OPTIMIZATION IN MARKETING
The goal of this project is to apply particle swarm optimization to a large optimization problem. The optimization problem has 10,000’s decision variables (real-valued), linear restrictions and a non-linear objective function, that is very computational demanding to evaluate. The application is about planning an advertising campaign. The decision variables correspondent to budgets allocated to a large set of media and the objective function represents the ROI of the campaign.
The main reason why we are interested in particle swarm optimization is that it can take advantage of a parallel computing environment. Working on this project, you will be implementing and tuning a number of variations of the algorithm, and evaluating these on a benchmark set of optimization problems.
6. WISDOM OF THE CROWD
Surveys about media consumption behavior are the core of many of our products. The biggest and most time-consuming part for respondents are questions related to the reach and frequency of specific media channels. Due to the large amount of channels, we are forced to ask respondents only about a subset of channels. Experiments show that we are able to collect much more/accurate information when asking people to assess group behavior rather than individual behavior.
The goal of this project is to combine two interviewing techniques, which will allow respondents to describe their media behavior in a non-traditional way.
- The first experimental treatment introduces previously collected data about media consumption, and will ask the respondent to make corrections to match this consumption with their own consumption.
- The second treatment includes questions that ask respondents about the average media consumption of their peer group rather than themselves.
By doing so we can assess respondents’ meta-knowledge about the population distribution rather than their own and correct for individual errors. This project includes combining these two treatments in a two by two experimental design and comparing the results with data collected by our traditional questionnaire design.
*Note: this project requires primary data collection and funding, these considerations can be done only after submitting a complete research proposal.
- Currently enrolled in a Master in Computer science, Econometrics, Mathematics, or other quantitative MSc;
- Some experience in R and/or Python programming, classification models and theory of simulation;
- Good command in spoken and written English;
- Available for at least 20 hours per week;
- Available for a period of at least 6 months.
If you are interested in a thesis internship, feel free to apply. We would like to find out if you are up for this challenge, so please include your CV, motivation letter and grades list.