CMT309 Coursework Assessment Pro-forma

Cardiff School of Computer Science and Informatics

Coursework Assessment Pro-forma

Module Code: CMT309

Module Title: Computational Data Science

Lecturer: Dr. Matthias Treder, Dr. Luis Espinosa-Anke

Assessment Title: CMT309 Programming Exercises

Assessment Number: 3

Date set: 06-03-2020

Submission date and time: 08-05-2020 at 9:30 am

Return date:

This assignment is worth 40% of the total marks available for this module. If coursework is

submitted late (and where there are no extenuating circumstances):

1 - If the assessment is submitted no later than 24 hours after the deadline, the

mark for the assessment will be capped at the minimum pass mark;

2 - If the assessment is submitted more than 24 hours after the deadline, a mark

of 0 will be given for the assessment.

Your submission must include the official Coursework Submission Cover sheet, which can

be found here:

https://docs.cs.cf.ac.uk/downloads/coursework/Coversheet.pdf

Submission Instructions

Your coursework should be submitted via Learning Central by the above deadline. You have

to upload the following files:

Description Type Name

Cover sheet Compulsory One PDF (.pdf) file Student_number.pdf

Your solution to question 1 Compulsory One Python (.py) file Q1.py

Your solution to question 2 Compulsory One Python (.py) file Q2.py

Your solution to question 3 Compulsory One Word (.docx) file Q3.docx

For the filename of the Cover Sheet, replace ‘Student_number’ by your student number, e.g.

“C1234567890.pdf”. Make sure to include your student number as a comment in all of the

Python files! Any deviation from the submission instructions (including the number and types

of files submitted) may result in a reduction of marks for the assessment or question part.

You can submit multiple times on Learning Central. ONLY files contained in the last attempt

will be marked, so make sure that you upload all files in the last attempt.

Assignment

Start by downloading the following files from Learning Central:

• Q1.py

• acronym_example1.txt, acronym_example2.txt, acronym_example3.txt,

acronym_example4.txt, acronym_tuples.txt

• Q2.py

• Q3.py

• Q3.docx

Then answer the following questions. You can use any Python expression or package that was

used in the lectures. Additional packages are not allowed unless instructed in the question.

Question 1 - What is the long form of the acronym? (Total 35 marks)

In this question, your task is to implement several functions that parse text strings for

acronyms and their long forms. Acronyms are abbreviations typically formed from the initial

letters of multiple words and pronounced as a word. For instance, the acronym "GPU" stands

for the long form "graphics processing unit". In this question, an acronym is defined as a

character sequence of at least two successive capital letters. Your task is to implement several

functions that together parse a text for acronyms and find their long forms.

As an example text, let us define the string

s = "A GPU, which stands for graphics processing unit, is different

from CPUs, says the IT expert. For some operations, a GPU is faster

than a CPU. GPUs are not always faster though."

Q1 a) Parse acronyms (10 marks)

Write a function read_file(filename) that receives as input a filename. The filename

includes the filepath. The function returns the entire content of the file as a single string.

Write a function find_acronyms(s) that receives as input a string s representing the text.

The function returns a list of acronyms. For our example above, find_acronyms(s) returns

the list ['GPU', 'CPU', 'IT']. Note: It is not important in which order the acronyms

appear in the returned list.

Q1 b) Find the long forms (15 marks)

In this question the hard work is done: given the acronyms, your task is to find their long

form in the text. To this end, write a function find_long_forms(s, acronyms). It receives

as input a string s representing the text and a Python list of acronyms. The function returns

a dictionary d with key-value pairs, where the key is the acronym and the value is its long

form. For instance, in our example above the output is the dictionary d = {'GPU' :

'graphics processing unit', 'CPU' : None, 'IT' : None}.

You can make the following assumptions:

• The long form is found in the same sentence as the acronym itself.

• If the acronym occurs multiple times in a text, its long form is found in the first

sentence that contains the acronym.

• Every '.' (dot) marks the end of a sentence. Sentences like "I talked to the Dr. and

raised my concerns." where dots are contained within the sentence will not occur.

• The first letter of the acronym is the same letter as the first letter of the first word of

the long form. All of the letters in the acronym need to appear in the long form.

• If no long form can be found for an acronym, it is set to None (Python's None type)

as in the dictionary above.

Four examples for texts with acronyms are given in the example files

acronym_example1.txt, acronym_example2.txt etc. The corresponding tuples

of (acronym, long form) are specified in the file acronym_tuples.txt.

Q1 c) Replace acronyms by long forms (10 marks)

Assume we want to make the document more self-explanatory and replace its acronyms with

their corresponding long forms. To this end, write a function replace_acronyms(s, d). It

receives as input a string s representing the text, and a dictionary d which contains

: long_form> key-value pairs as defined in Q1b). The function returns another string as

output. In this output, all acronyms in s have been replaced with their long forms. The

following rules apply:

- If an acronym has a long form, the sentence wherein the long form was defined

remains unchanged. For any other sentence, the acronym is replaced by the long form.

- If an acronym has no long form, it is not replaced anywhere.

- If you add the long form at the beginning of a sentence, make sure that its first word is

capitalised.

For instance, in our example above the output of the function is the string:

"A GPU, which stands for graphics processing unit, is different from

CPUs, says the IT expert. For some operations, a graphics processing

unit is faster than a CPU. Graphics processing units are not always

faster though."

As a starting point, use Q1.py from Learning Central. Do not rename the file or the function.

Q2 Statistics (Total 35 marks)

In this question, your task is to implement several statistical functions that perform t-tests,

linear regression, and variable selection.

Q2 a) Mass t-tests (10 marks)

In this question, your task is to implement two functions that perform dependent and

independent t-tests on input data. You can use the corresponding t-test functions in

scipy.stats.

Write a function mass_paired_ttest(X) that performs a series of paired-samples t-tests. It

receives as input a numpy array X with dimensions �×�, where � is the number of rows and

� is the number of columns. Each of the � columns represents one sample. Your function

should find the pair of columns that yields the lowest p-value i.e. it is the 'most significant'.

Then the function returns a tuple with three elements (index of the first column from the pair,

index of the second column from the pair, corresponding p-value).

Example: imagine your dataset is of (100, 3) shape i.e. has three columns. Assume the pvalues for the three pairs of colums are p = 0.4 (col 0 vs col 1), p = 0.12 (col 0 vs col 2), p =

0.08 (col 1 vs col 2). The lowest p-value is obtained for col 1 vs col 2 and its value is 0.08, so

the tuple that is returned by the function is t = (1, 2, 0.08).

Write a similar function mass_independent_ttest(*X) that performs a series of

independent t-tests. It takes multiple inputs: Each input is a vector (1-D array) representing

a single sample, so X is a list of Numpy arrays. The arrays can have different lengths. You

can access each array using its index, e.g. X[0] is the first array, X[1] is the second array etc.

Like for the paired-samples t-test, find the most significant pair of columns and return the

tuple of three elements.

Q2 b) Ridge regression (10 marks)

In this question your task is to implement ridge regression from scratch using Numpy. Do not

use statsmodels or scipy for this question. Ridge regression is a slightly modified version

of linear regression which is more stable for collinear data.

Let us first develop the theory behind linear regression: Assume you have a vector of

responses �∈ℝ�

, where � is the number of samples. Let �1, �2, �3,..., �p∈ℝ�

be our

predictors, where � is the number of predictors. Then our linear regression model is

�̂= �0 + �1�1+�2�2+ �3�3 + ... + �p�p

with �0 being the intercept and �1,...,�� being the slopes for the predictors. For convenience,

we store our predictors in a matrix �∈ℝ�×(�+1)

=[ �1, �2, �3,..., �p,�]. In other words, each

column of � represents one predictor. The last column consists entirely of ones, it represents

the intercept. We also store all �'s in a vector �=[ �1, �2, ... , �p, �0]∈ℝ�+1

. To calculate

� we can use the equation

� = (�⊤

�)

−1

�⊤

�

where �⊤ is the matrix transpose of � and the superscript ()−1

refers to the matrix inverse.

Unfortunately the inverse can be unstable or even undefined if �⊤

� is not well-conditioned.

As a fix, we will use a different formula called ridge regression which adds a so-called

regularization term ��.

� = (�⊤

�+��)

−1 �⊤

�

where �∈ℝ(�+1)×(�+1)

is an identity matrix and � is a positive number that represents the

regularization strength. The inverse then always exists as long as �>0. The parameter � has

to be provided by the user.

Your task: Write a function fit_ridge(y, X, a) that implements ridge regression as

defined above. It receives the following inputs:

- The response vector y is a numpy array with shape (n,1).

- The matrix X is a numpy array of predictors with shape (n, p). Note that X does not

contain the column of 1's, so you need to add it yourself.

- The input a represents the strength of regularization. a can be either a single number

(e.g. a = 1) or a list with multiple numbers (e.g. a = [1, 5, 10]).

If a is a single number, the function returns �, the ridge regression coefficients using a for

the regularization. If a is a list with multiple numbers, separate ridge regression solutions

should be calculated for each value of a. In this case, the function returns a Python list of

vectors of regression coefficients [�0,�1,�2,...], where �0 is the regression coefficients

using the first value of a, �1 is the regression coefficients using the second value a, and so

on.

Tip: remember than the * operator operates element-wise on Numpy arrays. If you want

proper matrix or vector multiplication like in linear algebra, you can use the @ operator.

Q2 c) Variable selection in linear regression (15 marks)

In this question, your task is to use statsmodels to implement two variable selection

functions for standard linear regression (a.k.a. OLS regression). The motivation is that

regression models can have dozens or even hundreds of predictors. This can make it difficult

to interpret the relationship between the predictors and the response variable y. Ideally, one

wants to identify a subset of the predictors that carries most of the information about y. A

possible approach is variable selection. In variable selection (‘variable’ means the same as

‘predictor’), variables get iteratively added or removed from the regression model. Once

finished, the model typically contains only a subset of the original variables.

In the following, we will call a predictor "significant" if the p-value of its coefficient is

smaller or equal to a given threshold. Your approach operates in two stages: In stage 1, you

iteratively remove predictors that are not significant. This leaves you with a subset of the

original predictors. In stage 2, you iteratively add interaction terms and keep them in the

model if they are significant. Remember what an interaction term is: if �1 and �2 are two

predictors, then the variable �= �1⋅ �2 is their corresponding interaction term. We will split

the two stages into two functions:

Stage 1 (remove variables)

Write a function remove_variables(y, X, threshold = 0.05, variable_names =

None). The function receives the following inputs:

• y and X are numpy arrays like in Q2b).

• threshold is the cut-off value that determines whether a p-value is significant. If a pvalue <= threshold, it counts as significant.

• variable_names is a Python list of variable names that a user can provide. This is the

names for the columns of X (e.g. ['TV', 'radio', 'newspaper'] for the advertisement

dataset discussed in the lecture). If no variable names are provided, your function

should create the variable names ['x1', 'x2’, ‘x3’, ...] where 'x1' is the name for the

first column of X, 'x2' is the name for the second column of X, and so on.

The function returns a tuple (new_X, new_variable_names) containing two variables:

• new_X is the matrix of predictors after non-significant variables have been removed. It

should not include the column of 1’s corresponding to the intercept.

• new_variable_names is a list of strings containing the variable names for the

columns of new_X.

Use the statsmodels function add_constant to make sure that X contains a column of 1's for

the intercept, and use the intercept in all fits. Next, these are the details on how to implement

the two stages of variable selection:

• To start, fit an OLS model using all of the predictors in X.

• Identify the predictor whose coefficient has the largest p-value. If it is not significant,

remove it and fit the model again.

• Repeat this process until either all predictors have been removed or all predictors left are

significant.

• Never remove the intercept irrespective of whether or not it is significant.

• If no predictors are left after stage 1, return the tuple (None, None).

Tip: It might be useful to use Boolean arrays to select subsets of columns of X.

Stage 2 (add interaction terms)

Write a function add_interaction_terms(y, X, threshold = 0.05, variable_names

= None). The inputs have the same meaning as in remove_variables. The function

returns a tuple (new_X, new_variable_names) containing two variables:

• new_X is the matrix of predictors after the interaction terms have been added. Hence,

it contains the predictors in X plus the interaction terms that have been added as new

columns to the right. It should not contain the column of 1’s corresponding to the

intercept term.

• new_variable_names is a list of strings containing the variable names for the

columns of new_X. For interaction terms, use names that combine the two variable

names with a ‘*’ sign. For instance, if you add the interaction term for ‘tv’ and

‘radio’, then call their interaction variable ‘tv*radio’.

The function implements the following algorithm:

• To start, fit an OLS model using all of the predictors in X.

• Test whether it is useful to add interaction terms: For each pair of predictors, add their

interaction term into the model. If the interaction term is significant, keep it in the model.

If it is not significant, remove it again.

• Continue this until you checked every pair of predictors.

• Never add an interaction term involving the intercept.

• It can happen that when you add new interaction terms, predictors that you previously

added become non-significant. You can ignore this issue.

• Add the interaction terms in order, starting from the leftmost predictor in X. For instance,

if you have predictors with column indices 1, 2, 3, and 4, you first add the [1, 2]

interaction term, then [1, 3], [1, 4], [2, 3], [2, 4], and finally [3, 4].

• After you checked the interaction terms for all pairs of predictors, you are finished.

Return the new set of predictors and variable names as defined above.

Finally, note that it should be possible to run both functions one after the other. For instance,

given y and X, the following two lines of code

(new_X,new_variable_names)=remove_variables(y, X)

(new_X,new_variable_names)=add_interaction_terms(y, new_X, variable_names=new_variable_names)

should first perform removal of variables and then add interaction terms.

As a starting point, use Q2.py from Learning Central. Do not rename the file or the function.

Question 3 – Ethics (Total 30 Marks)

In this question you will investigate bias in text corpora (document collections). You are

provided with two datasets from a recent data science competition on Hyperpartisan News

Detection [1]. These datasets are

- bias_corpus.txt: a corpus of news articles from media that have been

classified as exhibiting right or left political bias.

- nobias_corpus.txt: a corpus of news articles that have been classified

as neutral.

These newspaper articles are mostly written in the context of US politics. They could be used

for building targeted political ads (reader of newspaper X will prefer to see ads of party Y),

user or community profiling, etc. However, some articles may depict certain protected

communities (women, immigrants or LGBT) in a negative way. This may bias any data

science model built on top of this data.

In this question you implement a 'pattern matching' procedure for investigating how protected

communities are depicted in both corpora (biased vs non-biased). As an inspiration, you can

start experimenting with Hearst patterns [2], which are often used to identify word pairs in

which a type-of relationship holds. An example for a Hearst pattern is ‘X is a type of Y’.

The slots X and Y will be filled with matches in corpora, e.g., ‘cat is a type of animal’

or ‘sofa is a type of furniture’. Such patterns can also be used to reveal how certain

communities are depicted. For example, ‘immigrants and other x’ would reveal how

immigrants are depicted in these media. A neutral example could be ‘immigrants and

other communities’, whereas a (negatively) biased example could be ‘immigrants and

other criminals’. An initial list of patterns is provided below (with actual examples from

the data). However, you are free and encouraged to experiment with text patterns of

your own design. You can experiment using only X, only Y, or both X and Y as empty slots

(regex groups).

Pattern Example occurrence

X is a Y Obama is a citizen

X is Y Trump is threatening

X and other Y Refugees and other criminals

marginalized Y, especially X marginalized groups, especially gays

X works as a Y He works as a manager

She works as a hairdresser

Your tasks:

• Download and uncompress the text corpora from this url:

https://drive.google.com/drive/folders/1ATp_zALwRRG5-

rd9o0WEcP9IXKOGADSd?usp=sharing

• Decide on a person or community of interest. Define a pattern which you hypothesize

is likely to reveal how this person/community is depicted. This pattern could include

regular expressions and group matching for a slot x to fill. For example, the pattern

‘Trump is x’ is likely to match more verbs in non-biased media because they talk

more about what he does (‘Trump is speaking’ or ‘Trump is attending’). In

biased media, however, we could find more adjectives (‘Trump is arrogant’ or

‘Trump is great’).

• Retrieve and count the hits you get for each value of x in the biased and the non

biased corpora separately and store those in the dbias and dnobias dictionaries. For

example, if ‘Trump is speaking’ occurs twice in the non-biased corpus, then the

dictionary entry dnobias['speaking'] has the value 2.

• Do this pattern extraction process for three different persons/communities to obtain a

total of three case studies. You can use the same or different patterns.

Then, report your results in the template document Q3.docx as follows: For each of the three

case studies, write a short justification (up to 300 words for each) with

1. Your initial hypothesis, why you chose the pattern and no other, how many patterns

you tried for the case you wanted to test, etc.

2. Provide the comparison match frequency table for the two corpora (see example,

provided as a comment, at the end of Q3.py).

3. Discuss differences, if any, between the results obtained from the two corpora, and

highlight the stereotypical or negative depictions you found.

As a starting point, use Q3.py and Q3.docx from Learning Central. Your submission should

only include the Word document, not the Python script.

References:

[1] Kiesel, J., Mestre, M., Shukla, R., Vincent, E., Adineh, P., Corney, D., ... & Potthast, M. (2019,

June). Semeval-2019 Task 4: Hyperpartisan news detection. In Proceedings of the 13th

International Workshop on Semantic Evaluation (pp. 829-839).

(Available at https://www.aclweb.org/anthology/S19-2145/)

[2] Hearst, M. A. (1992, August). Automatic acquisition of hyponyms from large text corpora.

In Proceedings of the 14th conference on Computational Linguistics-Volume 2 (pp. 539-545).

Association for Computational Linguistics.

(Available at https://www.aclweb.org/anthology/C92-2082/)

Learning Outcomes Assessed

• Carry out data analysis and statistical testing using code

• Critically analyse and discuss methods of data collection, management and storage

• Reflect upon the legal, ethical and social issues relating to data science and its

applications

Criteria for assessment

Credit will be awarded against the following criteria. The score in each implemented function

is judged by its functionality. For Q1 and Q2, the functions you have implemented will be

tested against different data sets to judge their functionality. Additionally, quality and

efficiency (Q1) will be assessed. For Q3, marks are based on the written report. The below

table explains the criteria.

Criteria Distinction

Feedback on your coursework will address the above criteria. Feedback and marks will be

returned within 4 weeks of your submission date via Learning Central. In case you require

further details, you are welcome to schedule a one-to-one meeting. Feedback from this

assignment will be useful for next year’s version of this module as well as the Python for Data

Analysis module.

QQ：99515681
WeChat：codinghelp
Email：99515681@qq.com
Work Time：8:00-23:00

Hots

Ghostwriter Cs1b Spring 2024 Tth Hw08h... 2024-04-19
Help With Managing Financial Risk Prob... 2024-04-19
Ghostwriter Cs 0449 – Project 5: /Dev/ 2024-04-19
Ghostwriter Elec 2141 Digital Circuit ... 2024-04-19
Help With Csc171 — Videogame Projecthe 2024-04-19
Help With Comp3411 Artificial Intellig 2024-04-19
Help With Stat3061: Random Processes &... 2024-04-19
Ghostwriter Accounting 452, Spring 202... 2024-04-19
Ghostwriter Finc5001 Foundations In Fi... 2024-04-19
Ghostwriter 7Ssmm712 – Topics In Appli 2024-04-19
Help With Com 337 - Film Studies For T... 2024-04-19
Ghostwriter Mes202tc - Digital Vlsi Sy... 2024-04-19
Ghostwriter Geography 2041B Distance S... 2024-04-19
Ghostwriter Ecos3006 International Tra... 2024-04-19
Help With Fit5225 2024 Sm1 Creating An... 2024-04-19
Help With Cit 593: Introduction To Com... 2024-04-19
Help With Math 4931: Take Home Examgho... 2024-04-19
Ghostwriter Csci 547|Info 533: Systems... 2024-04-19
Ghostwriter Cs536-S24 Intro To Pls And... 2024-04-19
Help With Fit5212 - Assignment 1Ghostw... 2024-04-19

Programming Assignment Help！