How Useful are Your Comments? Analyz i ng and P redi ct i ng

How Useful are Your Comments?

Analyz i ng and P redi ct i ng YouTube Comment s

and Comment Ratings

Stefan Siersdorfer, Sergiu Chelaru,

Wolfgang Nejdl

L3S Research Center

Appelstr. 9a

30167 Hannover, Germany

{siersdorfer, chelaru, nejdl}@L3S.de

Jose San Pedro

Telefonica Research

Via Augusta, 171

Barcelona 08021, Spain

jsanpedr[email protected]

ABSTRACT

An analysis of the social video sharing platform YouTube

reveals a high amount of community feedback through com-

ments for published videos as well as through meta ratings

for these comments. In this paper, we present an in -depth

study of commenting and comment rating behavior on a

sample of more than 6 million comments on 67,000 YouTube

videos for which we analyzed dependencies between com-

ments, views, comment ratings and topic categories. In

addition, we studied the inﬂuence of sentiment expressed

in comments on the ratings for these comments using the

SentiWordNet thesaurus, a lexical WordNet-based resource

containing sentiment annotations. Finally, to predict com-

munity acceptance for comments not yet rated, we built dif-

ferent classiﬁers for the estimation of ratings for these com-

ments. The results of our large-scale evaluations are promis-

ing and indicate that community feedback on already rated

comment s can help to ﬁlter new unrated comments or sug-

gest particularly useful but still unrated comments.

Categories and Subject Descriptors

H.4 [Information Systems Applications]: Miscellaneous

General Terms

Algorithms, Experimentation, Measurement

Keywo rds

comment ratings, community feedback, youtube

1. INTRODUCTION

The rapidly increasing popularity and data volume of mod-

ern Web 2.0 content sharing applications is based on th eir

ease of operation even for unexperienced users, suitable mech-

anisms for supporting collaboration, and attractiveness of

shared annotated material (images in Flickr, bookmarks in

del.icio.us, etc.). For video sharing, the most popular site is

YouTube

. Recent studies have shown that traﬃc to/from

http://www.youtube.com

mittee (IW3C2). Distribution of these papers is limited to classroom use,

and personal use by others.

WWW 2010, April 26–30, 2010, Raleigh, North Carolina, USA.

ACM 978-1-60558-799-8/10/04.

Figure 1: Comments and Comment Ratings in

YouTube

this site accounts for over 20% of the web total and 10%

of th e whole internet [3], and comprises 60% of the videos

watched on-line [11].

YouTube provides several social tools for community in-

teraction, including t he possibility to comment published

videos and, in addition, to provide ratings about these com-

ments by other users (see Figure 1). These meta ratings

serve the purpose of helping the community to ﬁlter rele-

vant opinions more eﬃciently. Furthermore, because neg-

ative votes are also available, comments with oﬀensive or

inappropriate content can be easily skipped.

The analysis of comments and associated ratings consti-

tutes a potentially interesting data source to mine for obtain-

ing implicit kn owledge about users, videos, categories and

commu nity interests. In this p aper, we conduct a study of

this information with several complementary goals. On the

one hand, we study the viability of using comments and com-

munity feedback to train classiﬁcation models for deciding

on the likely community acceptance of new comments. Such

models have direct application to the enh ancement of com-

ment browsing, by promoting interesting comments even in

the absence of community feedback. On t he other hand, we

perform an in-depth analysis of the distribution of comment

ratings, including qualitative and quantitative studies about

WWW 2010 • Full Paper

April 26-30 • Raleigh • NC • USA

891

sentiment values of terms and diﬀerences across categories.

Can we predict the community feedback for comments? Is

there a conn ection between sentiment and comment ratings?

Can comment ratings be an ind icator for polarizing content?

Do comment ratings and sentiment depend on the topic of

the discussed content ? These are some of the questions we

investigate in this paper by analyzing a large sample of com-

ments from YouTube.

Clearly, due to the continu in g and increasing stream of

comment s in social sharing environments such as YouTube,

the community is able to read and rate just a fraction of

these. The methods we present in this paper can help to

automatically structure and ﬁlter comments. Analyzing the

ratings of comments for videos can provide indicators for

highly polarizing content; users of the system could be pro-

vided with diﬀerent views on that content using comment

clustering and aggregation techniques. Furthermore, auto-

matically generated content ratings might help to identify

users showing malicious beh avior such as sp ammers and

trolls at an early stage, and, in the future, might lead to

methods for recommending to an ind ividual user of the sys-

tem other users with similar interests and points of views.

The rest of th is paper is organized as follows: In Sec-

tion 2 we discuss related work on user generated content,

product reviews and comment analysis. Section 3 describes

our data gathering process, as well as the characteristics of

our dataset. In Section 4 we analyze the connection be-

tween sentiment in comments and community ratings us-

ing the SentiWordNet thesaurus. We then provide a short

overview of classiﬁcation techniques in Section 5, explain

how we can apply these techniques to rate comments, and

provide the results of large-scale classiﬁcation experiments

on our YouTube data set. In S ection 6 we analyze the corre-

spondence between comment ratings and polarizing content

through user ex periments. Section 7 describes dependencies

of ratings and sentiments on topic categories. We conclude

and sh ow directions for future work in Section 8.

2. RELATED WORK

There is a body of work on analyzing product reviews and

postings in forums. In [4] the dependency of helpfulness of

product reviews from Amazon users on the overall star rat-

ing of the product is examined and a possible explanation

model is provided. “Helpfulness” in that context is d eﬁned

by Amazon’s notion of how many users rated a review and

how many of them found it helpful. Lu et al. [17] use a latent

topic approach to extract rated q uality aspects (correspond-

ing to concepts such as “price”or “shipping”) from comments

in ebay. In [27] the temporal development of product rat-

ings and their helpfulness and dependencies on factors such

number of reviews or eﬀort required (writing review vs. just

assigning a rating) are studied. The helpfulness of answers

on the Yahoo! Answers site and the inﬂu ence of variables

such as required type of answer (e.g. factual, opinion, per-

sonal advice), topic domain of the question or “priori eﬀect”

(e.g. Did the inquirer some apriori research on the topic?) is

manually analyzed in [12]. In comparison, our paper focuses

on community ratings for comments and discussions rather

than product ratings.

Work on sent iment classiﬁcation and opinion mining such

as [19, 25] deals with the problem of automatically assigning

opinion values (e.g. “positive” vs. “negative” vs. “neut ral”)

to documents or topics using various text-oriented and lin-

guistic features. Recent work in this area makes also use

of SentiWordNet [5] to improve classiﬁcation performance.

However, the problem setting in th ese papers diﬀers from

ours as we analyze community feedback for comments rather

than trying to predict the sentiment of the comments them-

selves.

There is a plethora of work on classiﬁcation using proba-

bilistic and discriminative models [2] and learning regression

and ranking functions [24, 20, 1]. The popular SVM Light

software package [14] provides various kinds of parameteri-

zations and variations of SVM training (e.g., binary classi-

ﬁcation, SVM regression and ranking, transductive SVMs,

etc.). In this paper we will apply these techniques in a novel

context to automatic classiﬁcation of comment acceptance.

Kim et al. [15] rank product reviews according to th eir

helpfulness using diﬀerent textual features and meta data.

However, th ey report their best results for a combination of

information obtained from the star ratings (e.g. deviation

from other ratings) provided by the authors of the reviews

themselves; this information is not available for all sites,

and in particular not for comments in YouTube. Weimer

et al. [26] make use of a similar idea to automatically pre-

dict the quality of posts in the software online forum Nab-

ble.com. Liu et al [16] describe an approach for aggregat ion

of ratings on product features using helpfulness classiﬁers

based on a manually determined ground truth, and com-

pare their summarization with special “editor reviews” on

these sites. Another example of using commu nity feedback

to obtain training data and ground truth for classiﬁcation

and regression can be found in our own work [22], for an

entirely diﬀerent domain, where tags and visual features in

comb in ation with favorite assignments in Flick r are used to

classify and rank photos according to their attractiveness.

Compared to previous work, our paper is the ﬁrst to apply

and evaluate automatic classiﬁcation methods for comment

acceptance in YouTube. Furthermore, we are the ﬁrst t o

provide an in-depth analysis of the distribution of YouTube

comment ratings, including both qualitative and quantita-

tive studies as well as dependencies on comment sent iment,

rating diﬀerences between categories, and polarizing con-

tent.

3. DATA

We created our test collection by formulating queries and

subsequent searches for “related videos”, analogously to the

typical user interaction with the YouTube system. Given

that an archive of most common queries does not exist for

YouTube, we selected our set of queries from Google’s Zeit-

geist archive from 2001 to 2007, similarly to our previous

work [23]. These are generic queries, used to search for web

pages. In this way, we obtained 756 keyword queries.

In 2009, for each video we gathered the ﬁrst 500 comments

(if available) for the video, along with their authors, times-

tamps and comment ratings. YouTube computes comment

ratings by counting the number of “thumbs up” or “thumbs

down” ratings, which correspond to positive or a negative

votes by other users. In addition, for each video we col-

lected meta data such as title, tags, category, d escription,

upload date as well as statistics provided by YouTube su ch

as overall number of comments, views, and star rating for

the video. The complete collection used for evaluation had a

ﬁnal size of 67, 290 videos and about 6.1 million comments.

WWW 2010 • Full Paper

April 26-30 • Raleigh • NC • USA

892

0.01

0.02

0.03

0.04

0.05

0.06

0.07

1 10 100 1000 10000 100000

Frequency

N. Comments

Figure 2: Distribution of Number of Comments per

Video

500000

1e+06

1.5e+06

2e+06

2.5e+06

3e+06

3.5e+06

<-10 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 >10

Figure 3: Distribution of comment ratings

Figure 2 shows the distribution of the number of com-

ments per video in the collected set. The distribution follows

the expected zipﬁan pattern, characterized by h aving most

of the energy contained within the ﬁrst ranked elements as

well as subsequent long tail of additional low-represented el-

ements, valid for most community provided data. For our

collection, we observe a mean value of µ

comm

= 475 com-

ments per video, with ratings ranging from −1, 918 to 4, 170

for a mean value of µ

= 0.61.

Figure 3 shows the distribution of comment ratings. The

following two main observations can be made: On the one

hand, the distribution is asymmetric for positive and neg-

ative ratings, indicating that th e community tends to cast

more positive th an negative votes. On the other hand, com-

ments with rating 0 represent about 50% of the overall pop-

ulation, indicating that most comments lack votes or are

neutrally evaluated by the community.

Preliminary Term Analysis.

The textual content of comments in Web 2.0 infrastruc-

tures such as YouTube can p rovide clues on the community

acceptance of comments. This is partly due to the choice of

words and language used in diﬀerent kinds of comments. As

Table 1: Top-50 terms according to their MI val ues

for accepted (i.e. high comment ratings) vs. not

accepted (i.e. low comment ratings) comments

Terms for Accepted Comments

love favorit perfect wish sweet

song her perform hilari jame

best hot miss most talent

amaz my omg gorgeou feel

beauti d nice brilliant avril

awesom voic bless legend wond er

she rock music ador janet

thank lol sexi fantast danc

lt xd man heart absolut

cute luv greatest time watch

Terms for Unaccepted Comments

fuck ur game fuckin shut

suck dont fat worst gui

u ugli kill y im

gai dick idiot pussi jew

shit better dumb crap comment

stupid fag retard de die

bitch white bad cunt cock

ass fake know bore name

nigger black don loser asshol

hate faggot sorri look read

an illustrative example we computed a ranked list of terms

from a set of 100,000 comments with a rating of 5 or higher

(high community acceptance) and another set of the same

size containing comments with a rating of -5 or lower ( low

commu nity acceptance). For ranking the terms, we used the

Mutual Information (MI) measure [18, 28] from information

theory which can be interpreted as a measure of how much

the joint distribution of features X

(terms in our case) de-

viate from a hypothetical distribution in which features and

categories (“high community acceptance” and “low commu-

nity acceptance”) are independent of each other.

Table 1 shows the top-50 stemmed terms extracted for

each category. Obviously many of the “accepted” comments

contain terms expressing sympathy or commendation (love,

fantast, greatest, perfect). “Unaccepted” comments , on the

other hand, often contain swear words (retard, idiot) and

negative adjectives (ugli, dumb); this indicates that oﬀensive

comment s are, in general, not promoted by the community.

4. SENTIMENT ANALYSIS OF RATED

COMMENTS

Do comment language and sentiment have an inﬂuence on

comment ratings? In this section, we will make use of the

publicly available SentiWordNet thesaurus to study the con-

nection between sentiment scores obtained from SentiWord-

Net and the comment rating behavior of the community.

SentiWordNet [9] is a lexical resource built on top of Word-

Net. WordNet [10] is a thesaurus contain in g text ual descrip-

tions of terms and relationships between terms (examples are

hypernyms: “car” is a subconcept of “vehicle” or syn onyms:

“car” describes t he same concept as “automobile”). WordNet

distinguishes between diﬀerent part-of-speech types (verb,

noun, adjective, etc.) A synset in WordNet comprises all

terms referring to t he same concept (e.g. {car, automobile}).

WWW 2010 • Full Paper

April 26-30 • Raleigh • NC • USA

893

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Frequency

Positivity

Terms Corresponding to Negatively Rated Comments

Terms Corresponding to Positively Rated Comments

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Frequency

Negativity

Terms Corresponding to Negatively Rated Comments

Terms Corresponding to Positively Rated Comments

Figure 4: SentiValue histograms for term lists ac-

cording to MI

In SentiWordNet a triple of three senti values (pos, neg, obj)

(corresponding t o positive, negative, or rather neutral sen-

timent ﬂavor of a word respectively) are assigned to each

WordNet synset (and, thus, to each term in t he synset). The

sentivalues are in the range of [0, 1] and sum up to 1 for each

triple. For instance (p os , neg, obj) = (0.875, 0.0, 0.125) for

the term“good” or (0.25, 0.375, 0.375) for the term “ill”. Sen-

tivalues were partly created by human assessors and partly

automatically assigned using an ensemble of diﬀerent classi-

ﬁers (see [8] for an evaluation of these methods). In our ex-

periments, we assign a sentivalue to each comment by com-

puting the averages for pos, neg and obj over all words in

the comment that have an entry in SentiWordNet.

A SentiWordNet-based Analysis of Terms.

We want to provide a more quantitative study of the inter-

relation between terms typically used in comments with high

positive or negative ratings. To this end, we selected the

top-2000 terms according to the MI measure (see previous

section) for positively and negatively rated comments, and

retrieved their sentivalue triples (pos, neg, obj) from Senti-

WordNet if available.

Figure 4 shows the histograms of sentivalues for t hese

terms. In comparison to terms corresponding to positively

rated comments, we can observe a clear tendency of the

terms corresp onding to negatively rated comments t owards

higher negative sentivalue assignments.

Sentiment Analysis of Ratings.

Now we d escribe our statistical comparison of the inﬂu-

ence of sent iment scores in comment ratings. For our anal-

ysis, we restricted ourselves to adjectives as we observed

0.1

0.2

0.3

0.4

0.5

0.6

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Frequency

Negativity Sentivalue

5Neg

0Dist

5Pos

0.05

0.1

0.15

0.2

0.25

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Frequency

Objectivity Sentivalue

5Neg

0Dist

5Pos

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Frequency

Positivity Sentivalue

5Neg

0Dist

5Pos

Figure 5: Distribution of comment negativity / ob-

jectivity / positivity

the highest accuracy in SentiWordNet for these. Our intu-

ition is that the choice of terms used to compose a comment

may provoke strong reactions of approval or denial in the

commu nity, and therefore determine t he ﬁnal rating score.

For instance, comments with a high proportion of oﬀensive

terms would tend to receive more negative ratings. We used

comment -wise sentivalues, computed as explained above, to

study the presence of sentiments in comment s according to

their rating.

To this end, we ﬁrst subdivided the data set into three

disjoint p artitions:

• 5Neg: The set of comments with rating score r less

or equal to -5, r ≤ −5.

• 0Di st: The set of comments with rating score equal

to 0, r = 0.

• 5Pos: The set of comments with rating score greater

or equal to 5, r ≥ 5.

We then analyzed the dependent sentiment variables pos-

itive, objective and negative for each diﬀerent p artition. De-

tailed comparison histograms for these sentiments are shown

WWW 2010 • Full Paper

April 26-30 • Raleigh • NC • USA

894

0.1

0.2

0.3

0.4

5Neg 0Dist 5Pos

Mean Value

Partition

Negativity

Positivity

Figure 6: Diﬀerence of Mean values for sentiment

categories

in Figure 5. These ﬁgures provide graphical evidence of the

intuition stated above. Negatively rated comments (5Neg)

tend t o contain more negative sentiment terms than pos-

itively rated comments (5Pos), reﬂected on a lower fre-

quency of sentivalues at negativity level 0.0 along with con-

sistently higher frequ encies at negativity levels ≥ 0.1. Simi-

larly, positively rated comments t end to contain more posi-

tive sentiment terms. We also observe that comments with

rating score equal to 0 (0Dist) have sentivalues in between,

in line with th e initial intuition.

We further analyzed whether the diﬀerence of sentivalues

across partitions was signiﬁcant. We considered comment

positivity, objectivity and negativity as dependent variables.

Rating partition (5Neg, 0Dist, 5Pos) was used as the in-

dependent variable (grouping factor) of our test. Let us

denote µ

the mean value for sentiment s ∈ {N, O, P } (neg-

ativity, objectivity and positivity respectively) for partition

k ∈ {5Neg, 0Dist, 5P os}. Our initial null hypothesis states

that the distribution of sentiment values does not depend

on the partition states, i.e. the mean value of each inde-

pendent variable is equal across partitions H

: µ

5Neg

0Dist

= µ

5P os

. The alternative hypothesis H

states that

the d iﬀerence is signiﬁcant for at least two partitions. We

then used three separate one-way ANOVA (Analysis of Vari-

ance) procedures [6], a statistical test of whether the means

of several groups are all equal, t o verify t he null hypothesis,

, for each variable negativity (F

), objectivity (F

) and

positivity (F

We selected a random sample of 15, 000 comments. From

this, we discarded comment s for which sentiment values were

unavailable in SentiWordNet, resulting in a ﬁnal set of 5047

comment s. All tests resulted in a strong rejection of the

null hypothesis H

at signiﬁcance level 0.01. Figure 6 shows

the d iﬀerence of mean values for negativity and positivity,

revealing that negative sentivalues are predominant in nega-

tively rated comments, whereas positive sentivalues are pre-

dominant in positively rated comments.

The ANOVA test does not provide information about the

speciﬁc mean values µ

that refuted H

. Many diﬀerent

post-hoc tests exist to reveal this information. We used the

Games-Howell [6] test to reveal t hese inter-partition mean

diﬀerences because of its tolerance for standard deviation

heterogeneity in data sets. For negativity, the following ho-

mogeneous groups were found: { {5Neg}, {0Dist, 5Pos}

}. Finally, for positivity the following homogeneous groups

were found: { {5Neg}, {0Dist}, {5Pos} }. These results

provide statistical evidence of the intuition that negatively

rated comments contain a signiﬁcant ly larger number of

negative sentiment terms, and similarly for positively rated

comment s and positive sentiment terms.

5. PREDICTING COMMENT RATINGS

Can we predict community acceptance? We will use sup-

port vector machine classiﬁcation and term-based represen-

tations of comments to automatically categorize comments

as likely to obtain a high overall rating or not. Results

of a systematic and large-scale evaluation on our YouTube

dataset show promising results, and demonstrate the viabil-

ity of our approach.

5.1 Experimental Setup for Classiﬁcation

Our term- and SentiWordNet-based analysis in the previ-

ous sections indicates that a word-based approach for classi-

ﬁcation might result in good discriminative performance. In

order to classify comments into categories “accepted by the

commu nity” or “not accepted”, we use a supervised learning

paradigm which is based on training items (comments in

our case) that need t o be provided for each category. Both

training and test items, which are later given to the clas-

siﬁer, are represented as multi dimensional feature vectors.

These vectors can, for instance, be constructed using tf or

tf · idf weights which represent the importance of a term for

a document in a speciﬁc corpus. Comments labeled as “ac-

cepted” or “not accepted” are used to t rain a classiﬁcation

model, using probabilistic (e.g., Naive Bayes) or discrimina-

tive models (e.g., SVMs).

How can we obtain suﬃciently large training sets of “ac-

cepted” or “not accepted” comments? We are aware that the

concept is highly subjective and problematic. However, the

amount of community feedback in YouTube results in large

annotated comment sets which can help to average out noise

in various forms and, thus, reﬂects to a certain degree the

“democratic” view of a community. To this end we consid-

ered distinct thresholds for the minimum comment rating for

comment s. Formally, we obtain a set {( ~c

, l

), . . . ( ~c

, l

)} of

comment vectors ~c

labeled by l

with l

= 1 if the rating

lies above a threshold (“positive” examples), l

= −1 if the

rating is below a certain threshold (“negative” examples).

Linear support vector machines (SVMs) construct a hy-

perplane ~w ·~x+b = 0 that separates a set of positive training

examples from a set of negative examples with maximum

margin. For a new previously unseen comment ~c, the SV M

merely needs to test whether it lies on the “positive” side or

the “negative” side of the separating hyperplane. We used

the S VMlight [14] implementation of linear support vector

mach in es (SVMs) with standard parameterization in our ex-

periments, as this has been shown to perform well for various

classiﬁcation tasks (see, e.g.,[7, 13]).

We performed diﬀerent series of binary classiﬁcation ex-

periments of YouTube comments into the classes “accepted”

and “not accepted” as introduced in the previous subsection.

For our classiﬁcation experiments, we considered diﬀerent

levels of restrictiveness for these classes. Speciﬁcally, we

considered distinct thresholds for the minimum and max-

imum ratings (above/below +2/-2, +5/-5 and +7/-7) for

comment s to be considered as “accepted” or “not accepted”

by the community.

We also considered d iﬀerent amounts of randomly cho-

sen “accepted” training comments (T = 1000, 10000, 50000,

WWW 2010 • Full Paper

April 26-30 • Raleigh • NC • USA

895

0.2

0.4

0.6

0.8

0 0.2 0.4 0.6 0.8 1

Precision

Recall

AC_POS - Rating Threshold: 5 T: 50000

BEP

0.2

0.4

0.6

0.8

0 0.2 0.4 0.6 0.8 1

Precision

Recall

AC_NEG - Rating Threshold: 5 T: 50000

BEP

0.2

0.4

0.6

0.8

0 0.2 0.4 0.6 0.8 1

Precision

Recall

THRES-0 - Rating Threshold: 5 T: 50000

BEP

Figure 7: Comment Classiﬁcation: Precision-recall curves (50000 training comments per class, rating≥5)

Table 2: Comme nt Classiﬁcation Results (BEPs)

AC POS

T Rating ≥ 2 Rating ≥ 5 Rating ≥ 7

1000 0.6047 0.6279 0.6522

10000 0.642 0.6714 0.6932

50000 0.6616 0.6957 0.7208

200000 0.6753 - -

AC NEG

T Rating ≥ 2 Rating ≥ 5 Rating ≥ 7

1000 0.6061 0.629 0.6531

10000 0.6431 0.6724 0.6943

50000 0.6627 0.6966 0.7215

200000 0.6763 - -

THRES-0

T Rating ≥ 2 Rating ≥ 5 Rating ≥ 7

1000 0.5516 0.5807 0.6014

10000 0.5812 0.6264 0.6424

50000 0.6003 0.6456 0.6639

200000 0.6106 0.6586 0.6786

200000) as positive examples and the same amount of ran-

domly chosen “unaccepted” comment s as negative samples

(where that number of training comments and at least 1000

test comments were available for each of the two classes).

For testing the models based on these training sets we used

the disjoint sets of remaining “accepted” comments with

same minimum rating and a randomly selected disjoint sub-

set of negative samples of the same size. We performed

a similar experiment by considering “unaccepted” comments

as positive and “accepted” ones as negative, thus, testing the

recognition of “bad” comments. We also considered the sce-

nario of discriminating comments with a high absolute rat-

ing (either positive or negative) against u nrated comments

(rating = 0). The three scenarios are lab eled AC POS,

AC NEG, and THRES-0 respectively.

5.2 Results and Conclusions

Our quality measures are the precision-recall curves as

well as the precision-recall break-even points (BEPs) for

these curves (i.e. precision/recall at the point where preci-

sion equals recall, which is also equal to the F1 measure, the

harmonic mean of precision and recall in that case). The re-

sults for the BEP values are shown in Table 2. The detailed

precision-recall curves for the example case of T=50000 train-

ing comments class and thresholds +5/-5 for “accepted”/

“unaccepted” comments are shown in Figure 7. The main

observations are:

• A ll three types of classiﬁers provide good performance.

For instance, the conﬁguration with T=50,000 posi-

tive/negative training comments and thresholds +7/-7

for the scenario AC POS leads to a BEP of 0.7208.

Consistently, similar observations can be made for all

examined conﬁgurations.

• Trading recall against precision leads to applicable re-

sults. For instance, we obtain prec=0.8598 for re-

call=0.4, and prec=0.9319 for recall=0.1 for AC POS;

this is useful for ﬁnding candidates for interesting com-

ments in large comment sets.

• Classiﬁcation results tend to improve, as expected,

with an increasing number of training comments. Fur-

thermore, classiﬁcation performance increases with

higher thresholds for community ratings for which a

comment is considered as “accepted”.

6. COMMENT RATINGS AND

POLARIZING YOUTUBE CONTENT

In this section, we will study the relationship between

comment ratings and polarizing content, more speciﬁcally

tags/topics and videos. By “polarizing content” we mean

content likely to trigger diverse opinions and sentiment, ex-

amples being content related to the war in Irak or the pres-

idential election in contrast to rather “neutral” topics such

as chemistry or physics. Intuitively, we expect a correspon-

dence between diverging and intensive comment rating be-

havior and polarizing cont ent in Youtube.

Variance of Comment Ratings as Indicator for

Polarizing Videos.

In order to identify polarizing videos, we computed the

variance of comment ratings for each video in our dataset.

Figure 8 shows examples of videos with high versus low rat-

ing variance (in our speciﬁc ex amples videos about an I raki

girl stoned to death, Obama, and protest on Tiananmen

Square in contrast to videos about The Beatles, cartoons,

and amateur music). To show the relation between com-

ment ratings and polarizing videos, we conducted a user

evaluation of the top- and bottom-50 videos sorted by their

variance. These 100 videos were put into random order, and

evaluated by 5 users on a 3-point Likert scale (3: polarizing,

1: rather neutral, 2: in between). The assessments of the

WWW 2010 • Full Paper

April 26-30 • Raleigh • NC • USA

896

Figure 8: Videos with high (upper row) versus low

variance (lower row) of comment ratings

diﬀerent users were averaged for each vid eo, and we com-

puted the inter-rater agreement using the κ-measure [21],

a statistical measure of agreement between individuals for

qualitative ratings. The mean user rating for videos on top

of the list was 2.085 in contrast to a mean of 1.25 for videos

on the bottom (inter-rater agreement κ = 0.42); this is quite

a high diﬀerence on a scale from 1 to 3, and supports our hy-

pothesis that polarizing videos tend to trigger more diverse

comment rating behavior. A t-test conﬁrmed the statistical

signiﬁcance of this result (t= 7.35, d.f. = 63, P < 0.000001).

Variance of Comment Ratings as Indicator for

Polarizing Topics.

We were also studying the connection between comment

ratings and video tags correspondin g to polarizing topics.

To this end we selected all tags from our dataset occurring

in at least 50 videos resulting in 1, 413 tags. For each tag

we t hen computed the average variance of comment ratings

over all videos labeled with this t ag. Table 3 shows the

top- and bottom-25 tags according to the average variance.

We can clearly observe a higher tendency for tags of videos

with higher variance to be associated with more polarizing

topics such as presidential, islam, irak, or hamas, whereas

tags of videos with low variance correspond to rather neut ral

topics such as butter, daylight or snowboard. There are also

less obvious cases an example being the tag xbox with high

rating variance which might be due to polarizing gaming

commu nities strongly favoring either Xbox or other consoles

such as PS3, another example being f-18 with low rating

variance, a ﬁghter jet that might be discussed under rather

technical aspects in YouTube (rather than in th e context of

wars). We quantitatively evaluated this tendency in a user

experiment with 3 assessors similar to the one described for

videos using the same 3-point Likert scale and presenting

the tags to the assessors in random order. The mean user

rating for tags in the top-100 of the list was 1.53 in contrast

to a mean of 1.16 for tags on the bottom-100 (with inter-

rater agreement κ = 0.431), supporting our hypothesis that

tags corresponding to polarizing t opics tend to be connected

to more diverse comment rating behavior. The statistical

signiﬁcance of this result was conﬁrmed by a t-test (t=4.86,

d.f. = 132, P = 0.0000016).

Table 3: Top and Bottom-25 tags according to the

variance of comment ratings for the corresponding

videos

High comment rating variance

presidential nomination muslim shakira islam

campaign station itunes grassroots nice

xbox barack efron zac iraq

3g kiss obama deals celebrities

jew space shark hamas kiedis

Low comment rating variance

betting turns puckett tmx tropical

skybus peanut defender f-18 vlog

butter chanukah form savings iditarod

lent daylight egan snowboard havanese

menorah casserole 1040a 1040ez booklet

7. CATEGORY D EPENDENCIES OF

RATINGS

Videos in YouTube belong to a variety of categories such

as “News & Politics”, “Sports” or “Science”. Given that dif-

ferent categories attract diﬀerent types of users, an inter-

esting question is wheth er this results in diﬀerent kinds of

comment s, d iscussions and feedback.

7.1 Classiﬁcation

In order to study the inﬂuence of categories on the classiﬁ-

cation behavior, we conducted a similar experimental series

as described in section 5. In the following paragraphs, we

describe the results of classiﬁcation of YouTube comments

into the classes “accepted” and “not accepted” as introduced

in t he previous subsection. I n each classiﬁcation experiment

we restricted our training and test sets to comments from

the same class. We used smaller training sets than in sec-

tion 5 as we had less comments available per category than

for the overall dataset.

Figure 9 shows the precision-recall curves as well as the

break-even-points (BEPs) for comment classiﬁcation for the

conﬁguration T=10,000 training documents and threshold

+5/-5 for accepted/unaccepted comments. We observe that

training and classifying on diﬀerent categories leads to clear

diﬀerences in classiﬁcation results. While classiﬁers applied

within the categories“Music”and“Entertainment”show com-

parable performance, the performance drops for for “N ews

& Politics”. This might be an indicator for more complex

patterns and user relationships for that domain.

7.2 Analysis of comment ratings for different